Drew Hudson: Compositionality in Visual Reasoning and Generation

Показать описание

The world we live in is inherently compositional: just like a sentence is built upon phrases and words, a visual scene comprises a collection of interacting objects and entities, which in turn are derived from the sum of their parts. This compositionality plays a critical role in our ability to understand the world and adapt to novel situations. It is considered one of the fundamental building blocks of human intelligence.

How to incorporate such compositionality into AI models? How can we encourage neural networks to develop semantic understanding of our surroundings? And how can we leverage the emerging structured knowledge to improve in downstream tasks such as question answering or image generation? Stanford PhD candidate Drew Hudson presented models for multi-step synthesis of and reasoning over multi-object scenes, described key design principles, and illustrated the benefits they offer in this discussion.