Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications Fusion of Neural Networks, Fuzzy Systems and Genetic Algorithms: Industrial Applications
by Lakhmi C. Jain; N.M. Martin
CRC Press, CRC Press LLC
ISBN: 0849398045   Pub Date: 11/01/98
  

Previous Table of Contents Next


A new approach to the problem, formulated in [8] and reported in the following section, is based on a well-established technology, i.e., Time-Delay Neural Network (TDNN). It has shown the possibility of merging the two computational steps of phoneme recognition and articulatory estimation, performed so far one after the other, into a single process embedding coarticulation modeling (see Figure 9). The advantage of using the TDNN solution for the direct estimation of the mouth articulation from acoustic speech is clearly due to the finite memory of its neurons. Since their output represents the response to the weighted sum of a variable number of past inputs, the system can base its estimation on a suitably sized noncausal speech registration. The supervised training of this kind of system, based on a large audio/video synchronous training set, has demonstrated appreciable performance in articulatory estimation without requiring any a priori knowledge.


Figure 9  Speech is converted directly into lip movements without any intermediate stage of phoneme recognition (1063-6528/95$04.00 © 1995 IEEE).

4. The Use of Time-Delay Neural Networks for Estimating Lip Movements from Speech Analysis

4.1 The Implemented System

The speech conversion system has been implemented on a SGI workstation INDIGO 4000 XZ, 100 MHz, 48 bpp true color, z-buffer, double buffering, and graphic accelerator. The speech signal, after being sampled at 8 KHz and quantized linearly at 16 bits, undergoes multistage processing oriented to

  spectral preemphasis;
  segmentation into nonoverlapped frames of duration T = 20 ms (160 samples per frame);
  linear predictive analysis of 10-th order;
  power estimation and computation of the first 12 cepstrum coefficients;
  frame normalization.

Preemphasis is obtained through a FIR filter with transfer function F(z) = 1 - a z (a = 0.97). The frame duration T has been chosen equal to 20 ms in order to associate two consecutive audio frames to the same video frame (25 video frames/sec). Each frame has been filtered by means of Hamming windowing to reduce the spectral distortion and analyzed through the Durbin procedure for the estimation of 10 LPC coefficients. By means of simple linear operations, the envelope of the cepstrum is estimated and its first 12 coefficients are computed. The frame power (obtained directly from the value R^(0) of the estimate R^(t) of the autocorrelation function) is normalized to the range [Pmin, Pmax], where Pmin and Pmax are known a priori and represent the noise power and the maximum expected signal power, respectively. The 12 cepstrurn coefficients of the frame are then normalized to the range [-1, 1] and finally multiplied by the normalized power. From the example shown in Figure 10 it is apparent that the normalization procedure reshapes the cepstrum coefficients according to the power envelope.


Figure 10  Sonogram of the Italian word “traffico” before (a) and after normalization of the ceptrum coefficients (b) (1063-6528/95$04.00 © 1995 IEEE).

The normalized 12-dimensional vector of cepstrum coefficients is then presented to the actual conversion system. As described in Figure 11, conversion is based on a bank of Time-Delay neural networks (TDNN), each of them trained to provide estimates of the corresponding articulatory parameters. The TDNN outputs are then smoothed and sub-sampled 1:4 in order to associate the same configuration of articulatory parameters to 4 consecutive frames. The smoothing filtering is applied to stabilize the estimates while sub-sampling is forced by hardware constraints; the visualization system used can, in fact, display video frames at a maximum frequency of 12.5 frames/sec. corresponding to 80 ms speech segments. A synthesis program finally employs the vector of articulatory parameters to modify the wire-frame structure which models the face.


Figure 11  Scheme of the analysis-synthesis system implemented on the Silicon Graphic workstation (1063-6528/95$04.00 © 1995 IEEE).

4.2 The Time-Delay Neural Network

Classification and functional approximation are typically static tasks in which a unique output vector is associated to any possible input. Many natural processes have however an intrinsic time evolution like those related to the coordinated generation of the many body gestures involved in walking, dancing, writing, singing or, closer to the work reported in this chapter, in speaking. In these cases the recognition of a particular input configuration and the definition of suitable output values must be performed through the analysis of time correlated data implying the availability of suitable mechanisms for representing the time dependence between the network structure and its dynamics.

There are many different ways through which a neural network can represent the time information: recursive connections can be introduced, cost functions with memory can be employed or, alternatively, suitable time delays can be used as in the case of TDNNs. All these solutions exhibit peculiar characteristics which are more suited to handle some specific problems than others, so their appropriate choice is critical. The task of estimating the articulatory mouth parameters from the acoustic speech waveform can be formalized as follows: given a set of pairs {x(k), d(k)} where x(k) represents the k-th input vector to the network (whose components can be samples of the short-time spectrum envelope estimated from the k-th acoustic frame) and d(k) the corresponding target vector (whose components correspond to the articulatory parameters of the mouth measured at the same instant), a function may be defined as follows:


Previous Table of Contents Next

Copyright © CRC Press LLC