Skip to content

Audit: World Models - Lorin Achey and Carson Kohlbrenner#48

Open
cKohl10 wants to merge 40 commits intomainfrom
audit/cKohl10-lorinachey-world-models
Open

Audit: World Models - Lorin Achey and Carson Kohlbrenner#48
cKohl10 wants to merge 40 commits intomainfrom
audit/cKohl10-lorinachey-world-models

Conversation

@cKohl10
Copy link
Collaborator

@cKohl10 cKohl10 commented Feb 4, 2026

This technical audit explores the current state of the art for world models: Their architecture, robotics use cases, and limitations. The audit focuses on GAIA-1, Genie, TesserAct, and Cosmos as case studies for the audit.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

🚀 Preview Deployed

Your preview is ready for review!

🔗 Preview URL: https://arpg.github.io/vla-foundations/staging/pulls/48/textbook/audits/staging/Architecture-Fig2-TesserAct.png/

Review Checklist

  • LaTeX equations render correctly
  • All sections are complete per the template
  • References are formatted properly
  • Figures/diagrams display correctly

Next Steps

  1. Review your rendered content using the preview link above
  2. Tag @crheckman when ready for instructor review
  3. Push updates to auto-refresh the preview

This preview will be removed when the PR is closed.

cKohl10 and others added 20 commits February 3, 2026 23:56
@cKohl10
Copy link
Collaborator Author

cKohl10 commented Feb 10, 2026

@crheckman I believe all the final changes have been made and are ready for your review

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first half of reading period.

Classical simulators such as Isaac Sim \[1\] and Mujoco \[2\] capture the physical dynamics necessary for training embodied agents, however the hardcoded dynamics used in such simulators is not practical for large-scale data generation of nuanced physical phenomena and realistic rendering.
World models (also referred to as World Foundation Models or WFMs) offer an alternative, data-driven approach to simulation and future state prediction that can capture more nuanced physical phenomena and render realistic video/image outputs.
World models are trained to capture the underlying spatial and temporal dynamics in images and video to predict future states of the environment.
In this document, we will look at four prevalent world models: GAIA-1 \[3\], Genie \[4\], TesserAct \[5\], and Cosmos \[6\].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Architecture

Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of them consume video as context?

## Architecture

Each world model analyzed in this document fundamentally learns to predict the spatio-temporal dynamics of static frames.
Each model follows the encoder-decoder formulation where an encoder $\mathcal{E}$ ingests input frames $x$ from time $t=0:T$ and encodes them into latent tokens $z_{0:T}$, a dynamics model $\text{DYN}$ predicts the next latent tokens $z_{T+1:T+K}$, and a decoder $\mathcal{D}$ reconstructs the frames at time $t>T$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are none of them fully autoregressive? (ingest context, create latent vector, and autoregressively decode subsequent frames)? If not, and assuming such an architecture has been tried by anyone, seems like there may be an explanation (computational savings, stability, ...).


### Features

<table>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a way to render tables in Markdown with less ... HTML. Consider revising for reviewability's sake.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to hit this repo (not just the mdx file) with your favorite AI code assistant and help out in cleaning some of this up.

<td><strong>Cosmos</strong></td>
<td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
<td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
<td>10,000 H100 GPUs (for 3 months)</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

were they really training for this long

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llamas 405B model took 2 months to train on 16000 H100s too

</tr>
<tr>
<td><strong>TesserAct</strong></td>
<td><em>Not specified in sources</em></td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is built on CogVideoX-5B. https://github.com/UMass-Embodied-AGI/TesserAct/blob/main/doc/usage.md, 30GB of weights.


Tokenization is a critical component for world models as it compresses high-dimensional image data into a lower-dimensional latent space that the world model can efficiently reason over.
The naive approach of sectioning images into patches and flattening them into vectors is often insufficient for capturing the complex spatial and temporal relationships in image data at a scale sufficiently efficient enough for practical use of a world model.
State of the art world models instead use a variety of **discrete** and **continuous** tokenization approaches as follows:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define a "continuous token."

<td><strong>GAIA-1</strong></td>
<td><strong>Multimodal understanding</strong> and disentanglement of static and dynamic driving elements like pedestrians and road layouts.</td>
<td>Potential for <strong>sampling errors</strong> (loops or OOD artifacts) if autoregressive sampling strategies are not carefully tuned.</td>
<td>It uses a <strong>unified representation</strong> for video, text, and actions, but relies on a diffusion decoder to correct temporal inconsistencies in its latent predictions.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

briefly expand on why this might be an issue (i.e. sampling errors)

<td><strong>Cosmos</strong></td>
<td><strong>14 Billion</strong> (Cosmos-Predict1-14B variant)</td>
<td><strong>~20 Million hours</strong> of raw video ($10^8$ video clips)</td>
<td>10,000 H100 GPUs (for 3 months)</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be useful to mention how much compute is required to actually run these, if that's mentioned anywhere in the paper.

**Information Decay.** The tokenizer compresses 3.5M bits to 7,488 bits (470×).
Sub-pixel depth gradients, high-frequency textures, precise object boundaries, and small/distant objects may fall below tokenization resolution.

**The Semantic-Motor Gap.** GAIA-1 outputs video frames, not control commands.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was their target in the paper to use this world model for training purposes? Or was it to actually control vehicles in real time? If they are addressing it as a limitation, I'm wondering what their original intentions were..


**If this paper were a technical proposal at Zoox/Tesla, would I sign off?**

**For Production: CONDITIONAL NO**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be used it production? To generate possible future sequences in real time (maybe for MPC)? Or would it be used for RL offline for finetuning a policy? The feasibility might be different depending on the use case.

---

# Technical Paper Audits: World Models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate a brief history of world models dating back from the 80s/90s. This paper gives a good introduction: World Models

This dispels the idea that world models are a new thing.

<tr>
<td><strong>Genie</strong></td>
<td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
<td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like most of these models have much lower inference speeds compared to a standard simulator. Is there another way to get around this lower speed, like using larger batches in parallel, or is the seemingly better data just worth the speed trade-off?

Copy link
Collaborator

@crheckman crheckman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second half of reading period

<tr>
<td><strong>Genie</strong></td>
<td><strong>Unsupervised learning</strong> of interactive environments from massive, action-free Internet video corpora.</td>
<td>Limited to <strong>16 frames of memory</strong> and an inference speed of approximately <strong>1 FPS</strong>.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 frames of memory is a disaster. is this not somewhere they can make use of mRoPE and long-context training? is the only requirement bottleneck realtime inference?

<tr>
<td><strong>Cosmos</strong></td>
<td>Providing a <strong>highly scalable platform</strong> for Physical AI with state-of-the-art reconstruction quality.</td>
<td>Models still struggle with perfect <strong>physics adherence</strong> and object permanence in certain edge cases.</td>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physics adherence -> no models can actually adhere to physics, so if it's stated here, are the hallucinations/violations obvious?


* **Data and observability limits**: embodied, contact-rich interactions are underrepresented in large-scale datasets, and video-only observations cannot capture hidden state (e.g., forces, friction), limiting physics-faithful rollouts \[11\].
* **Physical consistency failures**: long-horizon generations can violate object permanence and contact dynamics, making some models unreliable as safety-critical simulators \[6\].
* **Weak closed-loop evidence**: GAIA-1 is a driving-focused generator rather than a deployable controller and is not evaluated in closed-loop autonomy \[3\].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to mention something about computational efficiency too.

We need to address the computational feasibility of the WFM in the control loop. If we run the WFM and VLA in parallel for predictive control, the inference latency of current generative architectures (diffusion/autoregressive) makes real-time operation impossible.

Furthermore, if we quantize or reduce sampling steps to force real-time performance, we risk washing out the variance in the simulation. This creates a 'mean-seeking' world model that fails to represent the dangerous edge cases our VLA actually needs to plan against. furthermore we'll end up with compounding simulation drift!

They can also serve as a "pre-trained" initialization to address **data scarcity** in real-world robotics.
* **Safe Policy Training:** By pairing a WFM with a reward model, agents can gain proficiency through **reinforcement learning** in a simulated environment that faithfully adheres to physical laws.
* **Planning and Model-Predictive Control (MPC):** Robots can use world models to simulate multiple potential future states based on different action sequences, executing only the path that maximizes the predicted reward.
* **Synthetic Data Generation for Sim2Real:** WFMs can generate massive amounts of synthetic video data, including metadata like **depth or semantic maps**, to bridge the gap between simulation and real-world deployment.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you must also mention something about the practical impossibility of modeling phenomena like non-specular reflection, radiative diffusion, granular media, and other physical phenomena that these models can pretty faithfully reconstruct at scale. This means we can "observe" edge case phenomena at a much higher frequency than we would casually encounter them in the world, and build models that understand them using these newly generated datasets.

Copy link
Contributor

@krusnim krusnim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice audit. I focused on the section on Genie for my comments.


## **3. Data & Scaling**
Genie follows the scaling laws typical of Large Language Models (LLMs).
* **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why Genie used this platformer data. (Are platformers specifically what Genie is "for?") They boast that this "generalizes beyond gaming to robotic manipulation," but that seems very suspect to me, unless they threw out the platformer data entirely and just used RT-1's dataset for that experiment. In which case, why lead with the platformer data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the paper I see now that the RT-1 version is a separate model. So the generality they're boasting is of the approach, not of a singular model - might want to make that slightly clearer.

## **5. Critical Synthesis & Sign-Off**

### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggle to understand how the paper gets around this limitation. Action [jump] for a platformer and for a robot are vastly different.

The model treats interactive environment generation as a **next-token prediction task**, where future states are conditioned on inferred latent actions.

### **2.2 Video Tokenization**
The **ST-ViViT tokenizer** (200M parameters) utilizes a **VQ-VAE** \[5\] with ST-transformer blocks in both the encoder and decoder.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth clarifying that (from my understanding) they actually use two VQ-VAEs: one for video tokenization and one for action tokenization.


## **3. Data & Scaling**
Genie follows the scaling laws typical of Large Language Models (LLMs).
* **Dataset (Platformers)**: Constructed by filtering 55M clips down to a high-quality "Curated" set of **6.8M clips (30,000 hours)**. Filtering distractor items like menu screens or streamer faces was found to be more beneficial than raw data quantity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would like to know how they performed dataset filtering, if they mentioned it. Platformer video seems pretty recognizable in comparison to other content so it seems there are some tricks they could use.


### **4.3 The Video-Only Assumption**
The fundamental technical thesis of Genie is that **ground-truth action labels are unnecessary for learning world models**.
By discarding the LAM encoder at inference and allowing a user to index the learned VQ codebook, Genie proves that internet-scale video provides enough causal structure to ground an agent's "understanding" of a world.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without action data, how does Genie differentiate between the agent and the (rest of the) environment?


### 2.2 Image Tokenizer (0.3B parameters)

**Architecture**: Fully convolutional 2D U-Net encoder-decoder with vector quantization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did they opt to use a U-net instead of a newer architecture (e.g. Transformer)?

## **5. Critical Synthesis & Sign-Off**

### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your presentation, you mentioned that the model only had 8 latent actions. I'm curious if there's a constraint here where the number of actions that work for 2D-gaming is inherently not enough to transition into 3D robotics; (even though this is a deliberate design choice to enable fully unsupervised!)

Cosmos emphasizes scalable generation but lacks demonstrated closed-loop robotics performance \[6\].
TesserAct reports downstream gains on manipulation benchmarks (e.g., RLBench) but remains compute-heavy and largely open-loop, leaving real-time/reactive control unresolved \[5\].
Genie learns interactive latent actions from web video but is not validated as a robotics world model for contact-rich control \[4\].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before hopping into the individual papers, I think it would be beneficial to add a short synthesis section that contrasts the papers directly. This would help the reader have larger context while reading each individual paper audit.

Most fundamentally, these models treat video as a sequence of images rather than as the evolution of a world state, lacking the ability to reason about object permanence, velocities, and interactions over time.
Imagen Video can generate a compelling video of "a car driving through Tokyo," but it cannot answer the safety-critical question "what happens if that car brakes suddenly?"

**GAIA-1's Proposed Solution.** GAIA-1 addresses this dual challenge by using two specialized components: a world model that reasons about high-level scene components and dynamics (answering "what happens next?"), and a video diffusion decoder that translates these latent predictions into high-quality pixel-space video (answering "what does it look like?").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a concise description of how here? That will help engage the reader sooner and connect past research to current innovation.

GAIA-1 introduces a generative world model for autonomous driving that combines autoregressive sequence modeling with video diffusion decoding to generate controllable driving scenarios.
The model accepts video, text, and action inputs, encoding them into discrete tokens and predicting future states autoregressively.
GAIA-1 demonstrates the ability to generate coherent scenes with realistic object interactions that were not explicitly provided in the training data.
However, the model does not run in real-time and lacks closed-loop evaluation, making its utility for actual autonomous driving control questionable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is closed-loop evaluation? What is it aiming to quantify and how?

**Mathematical Formulation**:

For input image $x_t$, the encoder produces discrete tokens:
$$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not rendering in my browser. Double check format.

Comment on lines +513 to +514

**Dataset**: 4,700 hours at 25 Hz (~420M images) from proprietary London driving data, with 400 hours held out for validation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cover that this training hampers generalization. Did they do any testing of this model on non-London driving conditions? After reading all these papers, do you think it would be possible to adapt this training to work on non-London datasets.


**RGB-DN vs. Full 3D Representation.** Point clouds, meshes, or NeRFs would provide more complete 3D information, but are computationally expensive to generate and lack mature pretrained models.
RGB-DN is a middle ground: richer than 2D pixels, cheaper than full 3D, and compatible with video diffusion architectures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cool to add a graphic here from the paper. I'd love to see this image -> extract depth information -> produce pointcloud in some visual fashion to see how effective it is.

<tbody>
<tr>
<td><strong>Data Curation</strong></td>
<td>The video curation pipeline transforms raw video into high-quality training data through a five-step process: <strong>splitting</strong> videos into shots, <strong>filtering</strong> for rich dynamics, <strong>annotating</strong> via VLMs, performing <strong>semantic deduplication</strong>, and <strong>sharding</strong> clips for model consumption.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how best to add it, but including of what sematic deduplication and sharding is.

## **5. Critical Synthesis & Sign-Off**

### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this significance of this finding?


### **5.1 Load-Bearing Assumptions**
* **Semantic Consistency**: The model assumes that a learned latent action (e.g., "Latent Action 1") will consistently map to the same semantic behavior (e.g., "Jump") across vastly different visual textures—an assumption that holds empirically across game genres and robotic scenes.
* **Intuitive Physics**: It assumes that the visual patterns in 2D platformers contain enough "common sense" to simulate complex features like **parallax and 3D depth**, which the 11B model successfully emulates.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the depth information isn't required for a quality simulation product. What do you think about this statement? Do you agree or disagree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opening to the audit does a comparison of the individual papers. Then the massive paper audits with all the good stuff. When I started reading the introduction to this audit, I didn't realize the significance of the intro in terms of understanding what was coming next. Not sure how to motivate this to the reader in the introduction more clearly, but I think it would be valuable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.