Abstract |
One approach to representing images is as a bag-of-regions vector, but this
representation discards potentially useful information about the spatial and
semantic relationships between the parts of the image. The central argument of
the research is that capturing and encoding the relationships between parts of
an image will improve the performance of downstream tasks. A simplifying
assumption throughout the talk is that we have access to gold-standard object
annotations.
The first part of this talk will focus on the Visual Dependency Representation:
a novel structured representation that captures region-region relationships in
an image. The key idea is that images depicting the same events are likely to
have similar spatial relationships between the regions contributing to the
event. We explain how to automatically predict Visual Dependency
Representations using a modified graph-based statistical dependency parser. Our
approach can exploit features from the region annotations and the description
to predict the relationships between objects in an image.
The second part of the talk will show that adopting Visual Dependency
Representations of images leads to significant improvements on two downstream
tasks. In an image description task, we find improvements compared to
state-of-the-art models that use either external text corpora or region
proximity to guide the generation process. Finally, in an query-by-example
image retrieval task, we show improvements in Mean Average Precision and the
precision of the top 10 images compared to a bag-of-terms approach. |