Object-Centric Representation Learning

Theoretical (Analytical):

Practical (Implementation):

Literature Work:


Recently, a growing body of work identifies object-centricicity as a suitable inductive bias for deep representation learning targeted at complex compositional down stream tasks. Learning object-centric representations from data opens up opportunities for robust, generalizable and interpretable machine learning. Structured deep learning models, specifically utilizing object-centric representations, have been identified as a potential tool for next-generation AI. In particular, object-centric representation learning, outputting a discrete set of latent vectors, each describing an individual object, is one tool to convert high-dimensional, perceptual data into a representation suitable for compositional models like Graph Neural Networks or Transformers. In this project, you programm a small test environment for visual object-centric learning, and train neural network models based on it. You are also tasked to write supporting code to run and analyze the results. For the test environment and model architectures, you can draw from existing code.

Problem Statement

Even for modern deep learning architectures, it is a challenge to extract high-level structure and components from unstructured inputs, e.g., images or videos. Uncovering those structures, i.e., individual objects and their respective relations enables many exiting applications. These include reasoning over longer time horizions, abstraction and categorization, as well as simple verbalization by mapping from discovered structures to natural language. For this topic, we explore object discovery and object-centric (deep) representation learning in a small toy problem setting, with the idea of extending it to future use cases.


  • Review current approaches in object-centric representation learning in computer vision
  • Create a small-scale experimental test environment
  • Decide together on a specific (existing) approach for a model architecture, preferebly including a multi-modal component
  • Run chosen approach on the test environment
  • Evaluate training and resulting models
  • Implement a small visualization tool to showcase the resulting model


Good programming skills in Python and Javascript/TypeScript.

Good knowledge of Deep Learning architectures and algorithms, specifically from Computer Vision.

Knowledge of Transformers and/or Graph Networks is an advantage.




  • Scope: Master
  • 3 Month Project, 6 Month Thesis
  • Start: immediately



[1] CLEVRER: COLLISION EVENTS FOR VIDEO REPRESENTATION AND REASONING, Yi K, GanC, Li Y, 2020, International Conference on Learning Representations

[2] MONet: Unsupervised Scene Decomposition and Representation, Burgess C, Metthey L, Watters N, 2019, ArXiV

[3] Attention over learned object embeddings enables complex visual reasoning, Ding D, Santoro A, 2020, CoRR, https://arxiv.org/abs/2012.08508v2