Automatic Feature Learning for Optical Character Recognition and Speech Recognition


Principal Investigator

Jonathan Baxter

Systems Engineering

Research School of Information Sciences and Engineering

Classical approaches to statistical pattern recognition problems require that the statistician first selects a set of appropriate features for the problem. This feature-selection process is informal, heuristic, and largely considered to be a "black art". Furthermore, the quality of the chosen set of features fundamentally constrains the performance of any pattern recognition device constructed using those features. Therefore, an obvious goal is to find ways of automatically learning the features.

The Principal Investigator (Jonathan Baxter) has recently developed a mathematical framework for feature learning and has proven that if a learner is embedded within an environment of related learning tasks then it is possible to learn features that are appropriate for the entire environment. Furthermore, the larger the number of tasks the more accurate the learnt features will be. In addition to these theoretical results, an algorithm for learning features using artificial neural networks has also been developed.

The purpose of this project is to experimentally verify the theoretical results in two domains: Japanese optical character recognition and spoken word recognition. These domains are ideal tests for the theory because they consist of a large number of related tasks. In Kanji OCR each character can be regarded as an individual learning problem, and there are thousands of different characters, while each spoken word can be viewed as an individual learning problem in the case of speech recognition.



v62 - PC



What are the results to date and the future of the work?

I have run a series of preliminary experiments in which a neural network feature-map was learnt using an artificial training set generated from an X-windows font. The experiments were very successful; a respresentative example is: a neural network was trained to an error of only 1.5% on 50,000 examples of 1000 Kanji characters. The feature part of the network was then extracted and used to learn 1000 novel characters with an error of only 1.5%. This experiment may well represent the first time a feature set learnt for one set of tasks has been used successfully to learn novel tasks from the same domain. This experiment also verified


- Appendix A



another aspect of the theory: increasing the number of training tasks decreases the number of examples of each task required for good generalisation. The number of examples required per task for good generalisation fell from 2000 when only one character was learnt to 50 when 1000 characters were learnt.

In November 1996 a database of 180,000 scanned and segmented images of Japanese characters was purchased from a research group in the U.S (CEDAR). The code was modified to accommodate the new input format and a further series of experiments were run. These simulations showed up a problem with the database, namely that many characters have very low representation (some have only one example). This causes problems for feature recognition because a minimum number of examples of each character is required to ensure good generalisation. With this in mind, a further set of experiments were run in which only those characters with more than fifty examples in the database were learnt (there are 643). The features learnt have not yet been tested for learning novel characters, but the network itself exhibited extremely good generalisation performance: error on an independent test set was just 5.5 %. It is worth noting that this is less than the error rate of 6% achieved by the group who sold us the data (CEDAR), who have been tweaking standard nearest-neighbour techniques for several years.

There are five immediate tasks for the next phase of the project:

1: Develop a suitable noise model so artificial data can be generated to overcome the problem of underrepresentation of some characters.

2: Train features on a larger subset of Kanji characters and test their classification performance on novel characters.

3: Analyse the feature set to determine what it is computing. This will provide strong hints for how to improve the preprocessing used in classification.

4: Further optimize the performance of the classifier by introducing a hybrid nearest-neighbour/neural-network technique (a crude version of this has already been tested and has decreased the error rate by 2%).

5: Develop a classifier for all Japanese characters (Kanji, Hiragana, Katakana and Romaji characters). As a first step the features learnt on the Kanji characterswill be used as a preprocessing for the other kinds of characters. It will be interesting to observe to what extent the features need to be changed to work well in these essentially new domains.

All of this work is further developing our understanding of the practical issues involved in feature learning. Concurrently with the Japanese OCR work, in March 1997 we will begin a preliminary investigation of the use of feature learning techniques for speech recognition.

What computational techniques are used?

All programs are written in "C", using the "#pragma parallel" extensions available on the PC to control parallel loops. The neural networks are trained by performing conjugate gradient descent with a heuristic line-search of my own devising that requires on average only 1.5 forward passes and 1 backwards pass through the network per line minimization. The forward



and backward passes are the execution bottleneck, but by using training set partitioning and some tweaking of the parallel code I have been able to achieve a near linear speedup in code execution (that is, on 8 processors my program runs about 7.8 times faster than it does on 1 processor).

- Appendix A