M. van Vulpen - Analysis and Recording of Multimodal Data

title:	Analysis and Recording of Multimodal Data
author:	Mathijs van Vulpen
published in:	September 2008
appeared as:	Master of Science thesis Man-machine interaction group Delft University of Technology
	PDF (7123 KB)

Abstract

Emotions are part of our lives. Emotions can enhance the meaning of our communication. However, communication with computers is still done by keyboard and mouse. In this humancomputer interaction there is no room for emotions, whereas if we would communicate with machines the way we do in face-to-face communication much information can be extracted from the context and emotion of the speaker. We have proposed a protocol for the construction of a multimodal database and a prototype that can be trained on this database for multimodal emotion recognition.
The multimodal database consists of audio and videos clips for lip reading, speech analysis, vocal affect recognition, facial expression recognition and multimodal emotion recognition. We recorded these clips in a controlled environment. The purpose of this database is to make it a benchmark for the current and future emotion recognition studies in order to compare the results from different research groups.
Validation of the recorded data is done online. Over 60 users scored the apex images (1.272 ratings), audio clips (201 ratings) and video clips (503 ratings) on the valence and arousal scale. Textual validation is done based on Whissell's Dictionary of Affect in Language. A comparison is made between the scores of all four validation methods and the results showed some clusters for distinct emotions, but also some scatter for certain emotions which depend mainly on the context. Context is not always available.
We created a prototype that can extract and track the facial feature points, this prototype is based on the system of Anna Wojdel. The prototype is designed in Matlab and is able to separate the audio from the video clip, extract frames and perform 5 different classifiers on the audio and video stream separately. For the auditory channel we have trained three classifiers: one for all 21 emotions, one for positive and negative emotions and one for active and passive emotions. For the visual channel we have trained two classifiers: one based on the found facial feature points and one based on AU activation. The classification results from our prototype are promising, considering we have 21 different emotions and trained the auditory classifiers on two persons and the visual classifiers on one person. Better results can be established if we have access to more samples from various people. The average classification rate for the three auditory classifiers is 38%, 36% and 59% respectively, for the two visual classifiers 2% and 0% respectively.

This page was updated on September 23, 2008
by webmaster@kbs