ResearchPosition Paper
Position Paper · v0.4

The Form
and the Shadow.

A substrate-surface paradigm for visual world models — and the four channels that get to write to the world.

AuthorsFirst Intuition Research Team
Version0.4 · May 2026
Reading time8 min
StatusLiving document

Pure generative video hallucinates worlds that don't exist. They look right for a few seconds and then drift — light gets brighter, characters change clothes, objects fall through floors. That's because the generator is the only thing remembering what just happened, and it has better things to do.

This paper sets out the paradigm we think the field should adopt. A controllable world model is physics-based, 3D-driven, and channel-committed: a single shared substrate S holds the geometry, the agents, and the history; a generative surface renders what an observer sees; and a small learned operator κ decides what the surface is allowed to write back as real.

01 · Form and shadow

A generator that owns the world tends to forget it. We separate what is from what appears: the substrate S is the source of truth — the form — and the rendered frame is its shadow. The renderer reads from S; the channels write to it through κ. Forgetting becomes an explicit policy decision, not a side effect of model capacity.

The generator proposes. κ decides. The substrate remembers.

In practice a frame is not predicted end-to-end. It's rendered from the substrate. When you walk through a scene, the next frame draws from the same persistent state — so the lantern that was lit a second ago is still lit, the chair you moved is still where you left it, and the character you talked to remembers what you said.

02 · The four channels

S is held as a hierarchical scene graph with attached video memory. Each slot is a small parametric module — geometry, material, animation, dialog state. Channels write deltas; κ resolves them.

  • Ch-Sense · external observation. Camera frames, depth, proprioception. Treated as evidence — not commitment.
  • Ch-Edit · user edit. Explicit, unconditional commitment from a human.
  • Ch-Imagine · the surface's own imagination. Gated by stability and confidence — the model doesn't get to imagine carelessly.
  • Ch-Act · embodied action. The primary channel. Physical interaction is what grounds substrate-hood.

κ is small — roughly 10⁷ parameters. Small enough to live in the loop with the renderer, large enough to be learnable. We train it as an inverse-variance commitment policy; in the Gaussian limit it reduces to a classical Kalman update.

03 · Read the paper

The full paper introduces the substrate-surface paradigm, its four axioms, the commitment operator κ, and a concrete architecture (SSP-PMGS) committed to a 33 ms-per-frame budget. We have an open call for collaborators; if you work on simulation, robotics, or interactive video and any of this resonates, reach out.

Paper · Position Paper · MMXXVI

The Form and the Shadow: A Substrate-Surface Paradigm for Visual World Models

First Intuition Research Team

Two leading labs have, independently, shipped the same architectural split — a persistent 3D representation alongside an amnesic generator — and bridged it only with a fixed routing rule. We argue the missing primitive is a calibrated fuser, and propose the Substrate-Surface Paradigm: an explicit physical substrate of what subsists, a generative surface of what appears, and gated channels that commit one to the other through a learned operator κ.

04 · Surfaces it powers

One substrate, several surfaces. The same model that lets a robot plan a manipulation also lets a viewer step into a vertical short on their phone:

  • Embodied AI · κ runs inside the robot's planning loop. We re-render the perceived scene and forward-tick it as a planner cost.
  • Worlds · S exposes a 3D + dialog surface. One sentence in, a walkable scene out.
  • Palette · the agent that produces interactive stories uses κ as its production canvas — script edits, casting, and storyboard re-cuts all commit through the same operator.
  • Interactive Worlds · the consumer surface where the audience finally meets the substrate. Watch, talk, step in.

05 · What's next

The current v0.4 substrate is single-scene, single-session. The next milestones are cross-session memory (so a character you spoke with last week still remembers you), multi-agent commitment (two characters writing to the same κ without race conditions), and physics fidelity (broader contact and deformation coverage).

Cite this paper · First Intuition Research Team. The Form and the Shadow: A Substrate-Surface Paradigm for Visual World Models. v0.4.