Automatic Speech Recognition Using Recurrent Neural Networks

title: Automatic Speech Recognition Using Recurrent Neural Networks
author: D. Nollen
published in: 1999
appeared as: Master of Science thesis
Delft University of Technology
pages: 116

Abstract

This thesis work examines the use of Recurrent Neural Networks in automatic speech recognition. A lot of research has been done in the field of automatic speech recognition. The state of the art techniques have already entered our everyday lives. Nonetheless the human-computer communication is still far away from the human-human interaction. Therefor new areas need to be explored. In this report, a modification is made of the Recnet phoneme recogniser, developed by A. Robinson. This phoneme recogniser is based on a Recurrent Neural Network. In the Recnet ASR the postprocessing is performed by a Hidden Markov Model. The goal is to create an automatic speech recogniser which is inspired by the working of the human brain and which only uses artificial neural networks.

The postprocessing consists of two phases. The output of the Recnet phoneme recogniser contains a probability distribution for every speech sample. This means that one or more probability vectors form a phoneme, depending on the duration of a phoneme. Successfully samples show a lot of variation due to errors made by the ASR and variation in speech. This first step is to convert this stream of probability vectors into a stream of single phonemes. The basic idea is to smooth this output using context information. The conventional method used to perform this conversion is the HMM. In this thesis work a RNN is used instead of the HMM. The main difficulty lies in the trainingdata of the RNN. The HMM still outperforms the RNN but the achieved improvements are promising.

The second phase is to segment the stream of phonemes into separate streams of phonemes that form the words. This isn’t implemented in the Recnet ASR. A Recurrent Neural Network will be used to segment the stream of phonemes. The network is based on a modification of the elementary RNN, the Elman RNN. This network was introduced by J.F. Elman and will be used to predict phonemes within a phoneme stream. This prediction of a phoneme can determine the beginning of a new word and ever detect and correct small errors. The performance of the word parser and error corrector is very good for a limited vocabulary of over 100 words.

 
blue line
University logo