AI has moved at an incredible pace in the last few years. Scaling up Transformers has led to remarkable capabilities in language (e.g., GPT-3, PaLM, Chinchilla), code (e.g., Codex, AlphaCode), and image generation (e.g., DALL-E, Imagen). There are strong indications, and first models that promise a new era of computing which will be defined by natural language interfaces that allow us to tell our computers what we want directly, rather than doing it by hand. Most interaction with computers will be done using natural language, not GUIs. We’ll tell our computer what to do, and it’ll do it. While we are getting closer to this goal, there are still existing challenges, e.g. (1) how machine learning agents can interact with user interfaces (2) how we as humans can communicate our intentions, and (3) how agents can learn what our preferences and intentions are. You will be part of this line of research by implementing an interface that agents (e.g. trained with reinforcement learning) can use to interact with web-based user interfaces. Such an interface can be the basis for future research for language-based interfaces for different applications.
We will use a small set of possible web sites, or an existing visual analytics web application as a basis for implementation. Your job is to implement an interface and simple agent that can interact with the application, e.g. press buttons, use sliders/controls, fill text boxes, etc. One important job is parsing the web site for labels/text etc. that can give agents information about the tasks it can perform. Finally, you will create a simple interface for human natural language input, e.g. a text box or even a speech-to-text interface.
- Get familiar with the basics of reinforcement learning, especially agent design
- Design an interface for automous agents to interact with with web applications
- Implement an interface to translate natural language input to instructions
Knowledge of Reinforcement Learning is a big plus.
- Scope: Bachelor/Master
- 3 Month Project, 3 Month Thesis or 6 Month Thesis
- Start: immediately
 Adapt ACT-1 model https://www.adept.ai/act
 Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos https://openai.com/blog/vpt/ (2022)
 Lotse: A Practical Framework for Guidance in Visual Analytics, Fabian Sperrle, Davide Ceneda, Mennatallah El-Assady (https://arxiv.org/abs/2208.04434), 2022
 Do As I Can, Not As I Say:Grounding Language in Robotic Affordances, Ahn et al., https://say-can.github.io/ (2022)