Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


2. Bimodality in Speech Production and Perception

Speech is the concatenation of elementary units, phones, generally classified as vowels if they correspond to stable configurations of the vocal tract or, alternatively, as consonants if they correspond to transient articulatory movements. Each phone is then characterized by means of a few attributes (open/closed, front/back, oral/nasal, rounded/unrounded) which qualify the articulation manner (fricative like /f/,/s/, plosive like /b/, /p/, nasal like /n/, /m/, ...) and articulation place (labial, dental, alveolar, palatal, glottal).

Some phones, like vowels and a subset of consonants, are accompanied by vocal cords’ vibration and are called “voiced” while other phones, like plosive consonants, are totally independent of cords’ vibration and are called “unvoiced.” In correspondence of voiced phones, the speech spectrum is shaped in accordance to the geometry of the vocal tract with characteristic energy concentrations around three main peaks called “formants,” located at increasing frequencies F1, F2, and F3.

An observer skilled in lipreading is able to estimate the likely locations of formant peaks by computing the transfer function from the configuration of the visible articulators. This computation is performed through the estimation of four basic parameters:

  the length of the vocal tract L;
  the distance d between the glottis and the place of maximum constriction;
  the radius r of the constriction;
  the ratio A/L between the area A of the constriction and L.

While the length L can be estimated a priori taking into account the sex and age of the speaker, the other parameters can be inferred, roughly, from the visible configuration. If the maximum costriction is located in correspondence with the mouth, thus involving lips, tongue, and teeth as it happens for labial and dental phones, this estimate is usually reliable. In contrast, when the maximum costriction is nonvisible like in velar phones ( /k/, /g/), the estimate is usually very poor.

2.1 The Task of Lipreading Performed by Humans

Lipreading represents the highest synthesis of human expertise in converting visual inputs into words and then into meanings. It consists of a personal database of knowledge and skills constructed and refined by training, capable of associating virtual sounds to specific mouth shapes, generally called “viseme,” and, therefore, infer the underlying acoustic message. The lipreader’s attention is basically focused on the mouth, including all its components like lips, teeth, and tongue, but significant help in comprehension comes also from the entire facial expression.

In lipreading, a significant amount of processing is performed by the lipreader himself/herself who is skilled in post-filtering the converted message to recover from errors and from communication lags. Through linguistic and semantic reasoning it is possible to exploit the message redundancy and understand by context; this kind of knowledge-based interpretation is performed by the lipreader in real time.

Audio-visual speech perception and lipreading rely on two perceptual systems working in cooperation so that, in case of hearing impairments, the visual modality can efficiently integrate or even substitute the auditory modality. It has been demonstrated experimentally that the exploitation of the visual information associated with the movements of the talker’s lips improves the comprehension of speech; the Signal-to-Noise Ratio (SNR) is incremented up to 15 dB and the auditory failure is transformed into near-perfect visual comprehension. The visual analysis of the talker’s face provides different levels of information to the observer improving the discrimination of signal from noise. The opening/closing of the lips is, in fact, strongly correlated to the signal power and provides useful indications on how the speech stream is segmented. While vowels, on one hand, can be recognized rather easily both through hearing and vision, consonants are, conversely, very sensitive to noise and the visual analysis often represents the only way for comprehension success. The acoustic cues associated with consonants are usually characterized by low intensity, a very short duration, and fine spectral patterning.


Figure 2  Auditory confusion of consonant transitions CV in white noise with decreasing Signal-to-Noise Ratio expressed in dB (From B.Dodd, R.Campbell, “Hearing by Eye: the Psychology of Lipreading,” Lawrence Erlbaum Assoc. Publ.).

The auditory confusion graph reported in Figure 2 shows that cues of nasality and voicing are efficiently discriminated through acoustic analysis, different from place cues which are easily distorted by noise. The opposite situation occurs in the visual domain, as shown in Figure 3, where place is recognized far more easily than voicing and nasality.

Place cues are associated, in fact, to mid-high frequencies (above 1 KHz) which are usually scarcely discriminated in most hearing disorders, contrary to nasality and voicing, which reside in the lower part of the frequency spectrum. Cues of place, moreover, are characterized by short-time fine spectral structure requiring high frequency and temporal resolution, different from voicing and nasality cues, which are mostly associated to unstructured power distribution over several tens of milliseconds.


Figure 3  Visual confusion of consonant transitions CV in white noise among adult hearing-impaired persons, by decreasing the Signal-to-Noise Ratio. Consonants, which are initially discriminated, are progressively confused and clustered. When the 11th cluster is formed (dashed line), the resulting 9 groups of consonants can be considered distinct visemes (From B. Dodd, R. Campbell, “Hearing by Eye: the Psychology of Lipreading,” Lawrence Erlbaum Assoc. Publ.).

In any case, seeing the face of the speaker is evidently of great advantage to speech comprehension and almost necessary in presence of noise or hearing impairments; vision directs the auditor attention, adds redundancy to the signal, and provides evidence of those cues which would be irreversibly masked by noise.


Previous Table of Contents Next

Copyright © CRC Press LLC