1998: J. Schalken

title:	Dutch Automatic Speech Recognition Using Kohonen Neural Networks
author:	J. Schalken
published in:	June 1998
appeared as:	Master of Science thesis Delft University of Technology
pages:	100

Abstract

Speech is for humans the fastest and most easy way to communicate. Therefore, the best way to communicate with machines would be speech. For this reason, men is searching for a way to teach computers to understand us. To have all what is said be converted to text and commands.

For years Hidden Markov Models (HMM) have been succesfully used for implementing automatic speech recognition. But lately neural network techniques are used increasingly. One decade ago Tuevo Kohonen introduced a new neural network architecture called Self-Organizing Map (SOM). This clustering technique can be used for speech recognition. For the input, a standard software packet is used to convert the speech into a stream of features (14 melscale cepstrum coefficients). The idea of the SOM is that each phoneme has its subspace in 14 dimensional feature space. So with the use of the SOM a 2-dimensional projection of these 14 dimensional clusters is realized. This projection is forms a phonetic map, where each phoneme has its own cluster, of a language, a so-called phonetic typewriter. During recognition, a path is formed over this map representing the spoken words.

The SOM has been implemented for foth Finish and Japanese language. Because these languages are written as they are spoken, i.e. there is a 1-1 correspondence between the phonemes and the characters. For the Dutch language, this one to one conversion isn't possible. However, we used post processing (HMM, vocabulary matching) to get a high recognition rate.

To improve the distinguishing power of the SOM algorithm a hierarchical version, Growing SOM Tree (GSOMT) is constructed. First one level is trained. After labeling this level is split into new SOM's, for each label one new child-SOM. For the next phase of training the master SOM is used as an input filter, which diverges the input to the correct child-SOM. These child-SOM's are trained traditionally. After training has finished and the nodes are labeled, the second layer can also be split (if necessary). And the process will continue until each cluster only contains one phoneme.

The phonetic typewriter (SOM) has been implemented for Dutch. For this purpose, a visual tool has been constructed for the PC. With this tool both the SOM as the GSOMT architecture can be tested, all parameters can be varied and the results can be made visible. Results of extensive experiments show that especially the vowels can be distinguished quite well, but all the non-vowels can hardly be distinguished between. The non-vowels can be separated in a couple of groups. On the highest level the vowels and non-vowels form two separate clusters, while on lower levels the different phonemes are separated.

This thesis present some performance grades for a Dutch trained SOM and GSOMT. To improve the training and recognition speed we developed a parallel implementation on a nCUBE2. And some pointers for future research are given. Although the power of the standard SOM and even the standard GSOMT, is too low for a real world speech recognizer, these techniques can form the bases of next generation speech recognizers. Especially good performance can be expected of hybrid (SOM/GSOMT and HMM techniques together) speech recognition.