Fabian Nasse, Rene Grzeszick and Gernot A. Fink
Proc. 9th International Conference on Computer Vision Theory and Applications (VISAPP), 2014.
In this paper a bottom-up approach for detecting and recognizing objects in complex scenes is presented. In contrast to top-down methods, no prior knowledge about the objects is required beforehand. Instead, two different views on the data are computed: First, a GIST descriptor is used for clustering scenes with a similar global appearance which produces a set of Proto-Scenes. Second, a visual attention model that is based on hiearchical multi-scale segmentation and feature integration is proposed. Regions of Interest that are likely to contain an arbitrary object, a Proto-Object, are determined. These Proto-Object regions are then represented by a Bag-of-Features using Spatial Visual Words. The bottom-up approach makes the detection and recognition tasks more challenging but also more efficient and easier to apply to an arbitrary set of objects. This is an important step toward analyzing complex scenes in an unsupervised manner. The bottom-up knowledge is combined with an informed system that associates Proto-Scenes with objects that may occur in them and an object classifier is trained for recognizing the Proto-Objects. In the experiments on the VOC2011 database the proposed multi-scale visual attention model is compared with current state-of-the-art models for Proto-Object detection. Additionally, the the Proto-Objects are classified with respect to the VOC object set.