Principal Investigator Andrew Tridgell Project r32

Computer Science Laboratory, Machine CM

Research School of Physical Sciences and Engineering

Phoneme Recognition Using Recurrent Neural Nets

This project aims to train recurrent neural networks suitable for incorporation into a hidden Markov model phoneme recognition system. This forms part of an ongoing PhD project entitled `The training and evaluation of multiple-layer hidden Markov models for automatic phonetic transcription'.

Automatic phonetic transcription systems aim to segment and label an acoustic speech signal using a set of symbols corresponding to the phonetic elements of a language. Several techniques have been applied to this problem, including discrete hidden Markov models, semi-continuous hidden Markov models, time-delay neural networks, recurrent neural-networks, and integrated approaches.

Until recently the most successful technique for this task has been the multiple-codebook discrete hidden Markov model, as demonstrated in the Sphinx system. This technique uses a statistically based learning algorithm to develop phoneme models based on a discretisation of a representation of the acoustic space.

The application of recurrent neural networks to this task has brought greatly improved recognition rates, both in terms of the segmentation and the labelling of phonetic segments. The recurrent neural network differs from conventional neural networks by introducing recurrent connections, which give the network a memory which can be used to store information about earlier portions of a signal. Recurrent networks have been also applied to a number of related tasks in speech recognition, including word segmentation and syllabification. Application have also been found in more general pattern classification tasks.

These networks are trained with an extension to the standard back-propagation algorithm, called `back-propagation through time'. This technique provides the first partial derivative of each of the weights in the network with respect to a sum of squares error relative to a defined target set. It is then necessary to perform some sort of gradient-descent on this error using any of the standard techniques. To achieve reasonable modelling of phonemes in continuous speech the number of weights in the model must be very large. This means the gradient-descent algorithm must be carefully formulated or the time taken for convergence will become unacceptable. The technique that has had the most success for this task is a statistical descent algorithm similar to simulated-annealing.

In this project a number of recurrent neural networks will be trained with different underlying acoustic parameters. These networks will then be combined into a multiple-layer hidden Markov model, which can weight the layers of the model on a state by state basis in order to optimise the recognition rate. The training algorithm for the networks will be the back-propagation through time with statistical descent algorithm that has been outlined earlier.

The weighting of the layers of the hidden Markov model is performed using a least squares fitting procedure to an `optimal path' defined by the labels in the speech database. This is one approach to the problem of combining multiple information sources in hidden Markov models.

What are the basic questions addressed?

The aim is to develop and train recurrent neural networks suitable for incorporation into a hidden Markov model phoneme recognition system. The neural network must perform well in speech recognition as well as being computationally efficient to train

What are the results to date and the future of the work?

This year saw the successful completion of this project, as far as supercomputing resources are concerned. The allocated time allowed the training to completion of the recurrent neural networks that were begun in the previous allocation period.

The resulting neural networks have been performing at a peak of 78% phoneme recognition on the TIMIT database with the 40 phone task. This is equal to the best results produced thus far for this task.

What computational techniques are used and why is a supercomputer required?

This project aimed to train a number of large recurrent neural networks for automatic speech recognition. This is part of a larger project investigating micro-recognition structures in multiple layer hidden Markov models for automatic phonetic labelling of speech data. The recurrent neural networks provide the backbone of the recognition system, but are very expensive to train in terms of the computing resources required.