Creating new worlds

Most people can mentally map spaces. Say you’re heading to the grocery store—you picture the route: milk aisle, cereal aisle, cash register, exit. There’s a debate over whether AI models possess a similar understanding of the environment, especially about whether reading is the same as real experience.

Essentially, current large language models (LLMs) are trained on gobs of data (similar to us reading a bunch). Then, when prompted, they respond with an answer based on the things they have read. For example, if you ask how to get from your hotel to the Eiffel Tower, it will produce a response based on its training data (whether from books, social media posts, etc)—but (and this is debated) it doesn’t visualize the journey: walking down streets, seeing vendors, or stopping at a coffee shop. If you ask for directions to a less-known location, like your friend's house, you're likely to get little relevant guidance.

This is where world models come in. World models are systems that simulate the physical world, representing the same types of dynamics and rules found in reality, rather than simply reproducing what they've "read." Essentially, this process gives AI the ability to visualize its environments as we do. This kind of understanding is likely necessary for computer systems to function more accurately in the real world. You wouldn’t want a robot to enter a store for your milk and cereal if it only had a vague sense of their locations—it might grab the wrong items, damage property, or cause a range of negative outcomes. This modelling would also support self-driving cars—which require extensive data—by allowing these systems to generate even more synthetic data for training.

The idea that we carry a small-scale model of the world dates back decades and has been researched for quite some time. LLMs took the world by storm; during this period, world models moved out of the spotlight. Now, large language models are seeing marginal improvements—far from the previous leaps— and world models are advancing significantly, bringing them into the limelight.

In January, Google granted access to Project Genie (currently US-only). I highly recommend watching the video on their blog post, which gives a glimpse of this tool's capabilities. Certainly, it is not perfect. The generations might not be that lifelike; it's limited to 60 seconds, and the character you control might not act the way you want. But it is jaw-dropping to see a world that you can move in be created based on a simple prompt. And they aren't the only ones developing these models. Meta and World Labs are among the many players working to develop these models.

Without getting too technical, there are three main approaches to developing a world model.

  1. One approach uses video game-style generators, as in Project Genie. These models fill in missing information based on the given data and adhere to the provided rules. For example, given a photo of a maze, the model can trace a route to the end.

  2. Another method generates fully realized 3D environments from the start, allowing multiple users to explore and interact over time. World Labs is a leader in this area, creating persistent worlds beyond simple video representations.

  3. Project the future. Current world models only focus on what is acutely imminent. Instead, the Joint-Embedding Predictive Architecture could quickly simulate a wide range of real-world features (e.g., how traffic will impact your commute).

We are still in the very early stages of developing these systems, but the future is promising. There could be standalone use cases for these models, or they could be combined with other systems (e.g, LLMs) to create a multimodal model.

Not much for you to do currently in this field of AI. Just being aware of the limitations of LLMs (for example, think twice before asking these tools for directions) and the potential of world models is a good first step.

Take care,

Emanuel

Previous
Previous

Rethinking Eisenhower’s matrix

Next
Next

Parsing through synthetic media