Take a look at this screenshot of Visual Basic 5.0, released in 1996:
It has autocomplete, a file picker, search, syntax highlighting, and a console.
Compare it to this screenshot of Visual Studio Code from 2024:
It has autocomplete, a file picker, syntax highlighting, a console, and an AI coding assistant.
More than just some grey text, coding assistants are the first major UI innovation in 28 years. Soon, UIs will look radically different.
There are now two user interfaces. One, the H-UI, is presented to the human. It takes the underlying state of the machine, the files and io, and converts that into images which the human can understand and interact with. It has the new challenge of also displaying the state of the AI, which is hugely complex and deeply mysterious, in a way that the human can similarly understand and interact with.
But the AI itself is also a user and needs an interface! The A-UI has to convert the state of the machine into something that the ai can competently and cheaply understand.
# Path: hello.py
# Compare this snippet from module1.py:
# def hello_world():
# return 'Hello, world!'
#
# def another_function(a, b):
# return a + b
#
# def my_subroutine():
# print("We all live in a yellow subroutine")
#
# foo = "foo"
import module1
print(module1.hello_world())
module1.my_subroutine()
module1.
This is the A-UI that GitHub Copilot receives for this state.
A prompt, the AI has one action it can perform which is to continue the text. The continuation is shown to the user as the AI suggestion. The other file I have open, module1.py, is presented as a comment. Human users understand the window metaphor, but LLMs will do better when the interface is presented as a linear set of text, because that is what they've been trained on.
For copilot, the A-UI is static. But in Snail, the A-UI can be manipulated by the agent without the user’s involvement. Here, the agent investigates the system with uname -a
and ls
before settling on its final command suggestion. Dynamic A-UIs are powerful because they let the agent and user operate independently. However great care must be taken to prevent the Agent from interrupting or harming the user’s work. In Snail, the agent is only allowed to run terminal commands that have no side effects.
Ideally AI agents have UIs designed specifically for them. But currently the vast majority of the world's UIs are designed for humans, so it is useful to be able to convert H-UIs into A-UIs.
Some language models are trained directly on video streams paired with keyboard/mouse input. These multimodal agents can operate on almost any UI designed for humans, but they lose a lot of competency compared to a language model operating on just text. They also are very expensive to run, because HD video is way more expensive than text.
The JoelBrowser renders web pages for an AI agent. It offers raw pixels for models that want it, but also offers text only views and a mixed mode where low resolution video is combined with textual information.
Human users expect that AI assistants will have read-only access to their data. JoelBrowser can read cookies and operate as the user, so that AI agents can log in as their human partner. But it blocks all secondary web traffic, so the AI cannot bid on a yacht on ebay without the human's permission.
Here is a dataset of the common crawl rendered in the JoelBrowser formats. If you have more specific requirements for your model, or just want to chat, please get in touch.
Language models are not smart enough to operate unsupervised for extended periods of time. Ask Claude to write a JSON parser, and it does an excellent job. Ask it to write a web browser, and it can't even get started.
This isn't just a problem of context sizes and model weights. A human programmer would be daunted by writing a web browser from scratch, and it would be impossible to do it without a good set of developer tools.
I created an IDE for LLMs. This A-UI lets the AI operate