![]() |
Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin CRC Press, CRC Press LLC ISBN: 0849398045 Pub Date: 11/01/98 |
Previous | Table of Contents | Next |
In multimedia applications like video-phone and video-conferencing, this intrinsic bimodality of speech is definitely ignored so that audio and video signals are handled separately as independent channels. The highest delivery priority is typically given to audio to guarantee continuity and quality of decoded speech while far less concern is usually devoted to video. As a result, images are displayed as soon as they are decoded without taking into account their coherence with audio. Video, in fact, has no time reference when encoded and transmitted and, therefore, no synchronization can be guaranteed at the decoder.
As a consequence, the visual-acoustic bimodality of speech is lost and annoying artifacts are reproduced with perceivable incoherence between the movements of the speakers lips and speech. This loss of quality imposes a severe impact on the human perceiver, especially if he has hearing impairments and his comprehension depends very much on the capability to correlate acoustic and visual cues of speech. Experimental results prove that a minimum of 15 synchronized video frames must be presented in one second to guarantee successful speech reading, meaning that acoustic cues extracted by the human hearing system from the speech input are not reliable for comprehension if they are not associated, over minimum time intervals of 60-70 ms, with coherent visual cues. In conclusion, it can be truly said that we hear not only by ear but also by eye.
To regain the lost synchronization between the acoustic and the visual modalities at the decoder, suitable post-processing must be applied with the usual constraint of real-time performances required by inter-personal communication applications like video-telephone and video-conferencing. In other applications based on virtual character animation, no real-time requirement is typically issued and a larger choice of possible solutions is offered.
Independently from the particular technical solution adopted, however, a correct audio/video re-synchronization mechanism must necessarily rely on a priori knowledge about the acoustic/visual correlations in speech whose evaluation, due the dependence on the language, on the speaker, on the linguistic and phonetic context, and on the speakers emotional status, definitely represents a hard task.
In applications for very low bitrate video-phone coding, synthetic images of the speakers mouth (generated from the articulatory estimates provided by speech analysis) can be interleaved with the actual images [34-35] thus increasing the frame rate at the decoder. Software and hardware demonstrators of speech-assisted video-phone are currently under development within the European ACTS project VIDAS.2
2VIDAS, activated in 1995 with the participation of DIST as coordinating member, is a 4-year project oriented to speech assisted video coding and representation.
In this scheme, a deformable wire-frame model is adapted to the mouth of an original frame and is then animated by means of the articulatory estimates provided by acoustic speech analysis. The animated textured mouth is therefore suitably pasted onto the original image thus generating interpolated/extrapolated synthetic images.
The first step is recording a valid corpus for training the system which is asked to correlate the two speech modalities and, afterward, be able to estimate visual cues from pure acoustic analysis of the speech signal. The main characteristics of a valid corpus are
In consideration of the huge amount of audio/video data to record and process, a reasonable approach would be that of first facing the problem from only a single-speaker point of view, thus decimating the corpus size. Since continuous speech analysis represents another main difficulty, an advisable suggestion would be that of focusing only on the analysis of single separate phones, diphones, and triphones, then passing on separate word analysis, and, only at the end, to continuous speech.
As pointed out before, a minimum rate of 15 Hz is necessary to perform any valid analysis of visual speech. A higher time resolution of video is recommended like 25/50 Hz (with video captured through PAL cameras) or 30/60 Hz (with NTSC cameras). Higher frame rates would help a lot, but the availability and cost of special professional cameras are often not affordable.
It may be noticed that, for the registration of the corpus, complete video information is definitely redundant since only the description of lips/tongue shape and position is needed while texture and color information are of no help in the analysis and represent 99% of the memory storage requirement.
For this reason, when the image acquisition system has enough computation power, it is usually more convenient to process video frames just after capturing and extracting the lips/tongue parameters on the flight without storing the entire image. This process is usually done through Chroma-Key techniques based on color segmentation. In this case both lips and tongue must be marked with special color make-up to aid in their extraction from the image. Getting rid of raw images just after capture, however, does not allow further parameter correction/integration as would be possible if images were stored.
Previous | Table of Contents | Next |