Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


The “protr” parameter (last row of the tables) was measured from the side view of the speaker’s face and expresses the lips protrusion; its use provided a very significant improvement in comprehension despite the fact that only the frontal view of the mouth is synthesized. This happens because different visemes, which were previously confused in the domain of frontal articulatory parameters, are now discriminated by the lips’ protrusion. It can be noticed that the larger format (256×256) adds details which are relevant when few parameters are used (first three rows of the table) and when the “teeth” and “protr” parameters are added (last two rows).

By doubling the code-book size, higher texture resolution in viseme reconstruction is obtained while, by doubling the number of key-frames, higher viseme discrimination is achieved. If both the number of key-frames and the code-book size are increased, significant quality improvement is achieved. In this last case, the visemes reconstruction from true articulatory parameters is almost indistinguishable from the original (88% correct recognition) while the estimation error introduced by the TDNNs lowers this score to 64% as a consequence of artifacts like parameter amplitude distortion and parameter trajectories smoothing. These impairments are particularly severe when high articulatory dynamics occur like in plosives: the sudden closure of the lips usually affects one single 25 Hz frame whose parameters are too coarsely estimated by the TDNNs. Other severe impairments concern the reproduction of glottal stops which must be discriminated from generic silence intervals in order to be associated to suitable context-dependent visemes.

References

1  Curinga, S., Grattarola, A.A., and Lavagetto, F. (1993), Synthesis and Animation of Human Faces: Artificial Reality in Interpersonal Video Communication, pp. 397-408, Proceedings of IFIP TC 5/WG 5.10 Conference on Modeling and Computer Graphics, Genova, Italy.
2  Aizawa, K., Harashima, H., and Saito, T. (1989), Model-Based Analysis-Synthesis Image Coding (MBASIC) System for Person’s Face, Image Communication, 1, pp. 139-152.
3  Nakaya, Y., Chuah Y.C., and Harashima, H. (1991), Model-based/waveform hybrid coding for videotelephone images, pp. 2741-2744, Proceedings ICASSP-91, San Francisco, CA.
4  Morishima, S. and Harashima, H. (1991), A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface, IEEE Journal on Sel. Areas in Comm., 9, 4, pp. 594-600.
5  Bothe, H., Lindner, G., and Rieger, F. (1993), The Development of a Computer Animation Program for the Teaching of Lipreading, pp. 45-49, Proceedings of 1st TIDE Conference, Brussels, Belgium.
6  Murakami, S. and Kumazaki, M. (1993), Lip Reading by 3D Features for Model Based Image Coding, paper No. 2.5, Proceedings of PCS-93, Lausanne, Switzerland.
7  Hill, D.R., Pearce, A., and Wyvill, B., (1988), Animating Speech: an Automated Approach Using Speech Synthesised by Rules, The Visual Computer, 3, pp. 176-186.
8  Lavagetto, F. (1995), Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People, IEEE Trans. on RE, 3, 1, pp. 90-102.
9  Morishima, S., Aizawa K., and H. Harashima (1988), Model-Based Facial Image Coding Controlled by the Speech Parameter, paper No. 4.4, Proceedings of PCS-88, Turin, Italy.
10  Summerfield, A.Q. (1979), Use of Visual Information for Phonetic Perception, Phonetica, 36, pp. 314-331.
11  Erber, N.P. (1972), Auditory, Visual and Auditory-Visual Recognition of Consonants by Children with Normal and Impaired Hearing, Journal of Speech and Hearing Research, 15, pp. 413-422.
12  Wakita, H. (1973), Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech Waveforms, IEEE Trans. on Audio Electroacoust., 21, pp. 417-427.
13  Yuhas, B.P., Goldstein, M.H. Jr., and Sejnowski, T.J. (1989), Integration of Acoustic and Visual Speech Signal Using Neural Networks, IEEE Communications Magazine, 27, 11, pp. 65-71.
14  Welsh, W.J., Simons, A.D., Hutchinson, R.A., and Searby, S. (1990), A Speech-Driven ‘Talking-Head’ in Real Time, paper 7.6, Proceedings of PCS-90, Cambridge, MA, 1990.
15  Chen, T., Graf, H.P., and Wang, K. (1994), Speech-assisted Video Processing: Interpolation and Low Bitrate Coding, in Procedings of 28th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA.
16  Chen, T., Graf, H.P., and Wang, K. (1995), Lip Synchronization Using Speech-Assisted Video Processing, IEEE Signal Processing Letters, 2, 4, pp. 57-59.
17  Chen, T., Graf, H.P., Haskell, B., Petajan, E., Wang, Y., Chen, H., and Chou, W. (1995), Speech-Assisted Lip Synchronization in Audio-Visual Communications, pp. 579-582, Proceedings of IEEE ICIP-95.
18  Sokol, R. and Mercier, G. (1996), Neural-fuzzy Networks and Phonetic Feature Recognition as a Help for Speechreading, Speechreading by Humans and Machines, edited by D.G. Stork and M.E. Hennecke, Springer, pp. 497-504.
19  Goldschen, A.J., Garcia, O.N., and Petajan, E.D. (1996), Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition, Speechreading by Humans and Machines, edited by D.G. Stork and M.E. Hennecke., Springer, pp. 505-515.
20  Stork, D.G. and Hennecke M.E. (1996), Speechreading by Humans and Machines, Springer.


Previous Table of Contents Next

Copyright © CRC Press LLC