Object-Centric Reinforcement Learning

Theoretical (Analytical):

Practical (Implementation):

Literature Work:


Recently, a growing body of work identifies object-centricicity as a suitable inductive bias for deep representation learning targeted at complex compositional down stream tasks. Learning object-centric representations from data opens up opportunities for robust, generalizable and interpretable machine learning. Structured deep learning models, specifically utilizing object-centric representations, have been identified as a potential tool for next-generation AI. In particular, object-centric representation learning, outputting a discrete set of latent vectors, each describing an individual object, is one tool to convert high-dimensional, perceptual data into a representation suitable for compositional models like Graph Neural Networks or Transformers. In this project, your task is to adapt object-centric architectures (e.g., from computer vision) to reinforcement learning problems. You are free to recommend your own environment/scenario, but one possiblity would be to use the MineRL environment for training of RL agents (based on the game Minecraft). You are tasked with implementing an architecture, optimizing it for the use in the RL scenario, and train initial models.

Problem Statement

Even for modern deep learning architectures, it is a challenge to extract high-level structure and components from unstructured inputs, e.g., images or videos. Uncovering those structures, i.e., individual objects and their respective relations enables many exiting applications. These include reasoning over longer time horizions, abstraction and categorization, as well as simple verbalization by mapping from discovered structures to natural language. For this topic, we explore object discovery and object-centric (deep) representation learning in a small toy problem setting, with the idea of extending it to future use cases.


  • Review current approaches in object-centric representation learning in computer vision
  • Create a small-scale experimental test environment
  • Decide together on a specific (existing) approach for a model architecture, preferebly including a multi-modal component
  • Run chosen approach on the test environment
  • Evaluate training and resulting models
  • Implement a small visualization tool to showcase the resulting model


Good programming skills in Python and Javascript/TypeScript.

Good knowledge of Deep Learning architectures and algorithms, specifically from Computer Vision.

Knowledge of Transformers and/or Graph Networks is an advantage.




  • Scope: Master
  • 3 Month Project, 6 Month Thesis
  • Start: immediately



[1] Object-Centric Learning with Slot Attention, Francensco Locatello et al. (https://arxiv.org/abs/2006.15055)

[2] MONet: Unsupervised Scene Decomposition and Representation, Burgess C, Metthey L, Watters N, 2019, ArXiV

[3] Attention over learned object embeddings enables complex visual reasoning, Ding D, Santoro A, 2020, CoRR, https://arxiv.org/abs/2012.08508v2

[4] MineRL environment (https://minerl.io/)