Visual Scene Interpretation as a Dialogue between Vision and Language

TitleVisual Scene Interpretation as a Dialogue between Vision and Language
Publication TypeConference Papers
Year of Publication2011
AuthorsYu X, Fermüller C, Aloimonos Y
Conference NameWorkshops at the Twenty-Fifth AAAI Conference on Artificial Intelligence
Date Published2011/08/24/

We present a framework for semantic visual scene interpretation in a system with vision and language. In this framework the system consists of two modules, a language module and a vision module that communicate with each other in a form of a dialogue to actively interpret the scene. The language module is responsible for obtaining domain knowledge from linguistic resources and reasoning on the basis of this knowledge and the visual input. It iteratively creates questions that amount to an attention mechanism for the vision module which in turn shifts its focus to selected parts of the scene and applies selective segmentation and feature extraction. As a formalism for optimizing this dialogue we use information theory. We demonstrate the framework on the problem of recognizing a static scene from its objects and show preliminary results for the problem of human activity recognition from video. Experiments demonstrate the effectiveness of the active paradigm in introducing attention and additional constraints into the sensing process.