Learning Primitive Embodied World Models

Towards Scalable Robotic Learning

Video Diffusion Action Primitives VLM + World Model
“We can only see a short distance ahead, but we can see plenty there that needs to be done.”

— Alan Turing

Overview

🧠 Cerebro-Cerebellar Hierarchy

We decouple high-level planning (VLM as "cerebrum") from low-level dynamics modeling (PEWM as "cerebellum"), enabling modular, scalable, and real-time embodied intelligence.

⚡ Real-Time Generation

Our causal distillation achieves 12 FPS video generation on a single GPU, making it suitable for real robot control loops.

🧩 Compositional Generalization

Zero-shot success on unseen (predicate, object) combinations via primitive recombination, enabling flexible task execution.

Data Philosophy

Our approach is rooted in a fundamental observation about embodied intelligence: data determines generalization. In high-dimensional, continuous robotic control spaces, the scarcity of real-world interaction data severely limits model expressivity and transferability. Unlike language or vision models trained on internet-scale data, embodied agents face a "data desert"—sparse, expensive to collect, and often non-i.i.d.

This leads to a critical bottleneck: without sufficient coverage of state-action trajectories, even powerful diffusion models fail to generalize beyond training distributions. We argue that compositionality through primitives is the key to breaking this barrier— enabling rich behavior from limited data by recombining learned skills.

Data efficiency analysis across embodied AI methods

Figure: PrimitiveWorld achieves higher data efficiency by decomposing complex tasks into reusable action primitives, reducing the need for exhaustive task-specific data collection.

By modeling short-horizon dynamics within primitive-conditioned world models, we decouple skill learning from planning. This enables zero-shot composition— executing unseen task combinations by reusing learned motion priors. As shown below, this architectural choice unlocks strong generalization even in low-data regimes.

Generalization performance of diffusion-based world models

Figure: Generalization in Diffusion Models. PrimitiveWorld outperforms end-to-end approaches on unseen (predicate, object) combinations, demonstrating compositional reasoning enabled by structured world modeling.

“Structure your model where data is scarce, and let it learn where data is abundant.”

Methodology

Cerebro-Cerebellar Architecture

  • 🧠 VLM Planner ("Cerebrum"): Decomposes tasks into primitive sequences using natural language.
  • 🦴 PEWM ("Cerebellum"): Generates realistic short-horizon video rollouts conditioned on start-goal heatmaps.
  • 🎯 SGG Guidance: Start-Goal heatmap provides spatial priors for precise control.
  • 🔁 Closed-Loop Execution: VLM iteratively queries PEWM for next primitive execution.
Cerebro-Cerebellar Architecture

Schematic of the hierarchical control system

System Architecture

Cerebro-Cerebellar Hierarchy

🧠

VLM Planner (Cerebrum)

Decomposes high-level tasks into primitive sequences using natural language reasoning.

🦾

PEWM (Cerebellum)

Generates realistic short-horizon video rollouts conditioned on start-goal heatmaps.

🎯

SGG Guidance

Start-Goal heatmap provides spatial priors for precise control and closed-loop execution.

Execution Workflow:

  1. VLM planner decomposes "Pick up yellow tape measure" into sequence: approach object → close gripper → lift
  2. For continuous 6-DoF motions, current observation is sent to PEWM for video generation
  3. Generated video frames are processed by Gen6D to extract end-effector trajectory
  4. For discrete gripper actions (open/close), commands are executed symbolically without video generation
  5. Robot executes trajectory, and system iterates to the next primitive

Quantitative Results

Table: Success Rates on RLBench Tasks (%)

Method Close Box Open Drawer
Image-BC [1] 62 58
UniPi [2] 71 69
4DWM [3] 76 73
PrimitiveWorld (Ours) 89 85

[1] Jang et al., 2022 | [2] Du et al., 2023 | [3] Zhen et al., 2025

Table: Video Generation Efficiency

Performance on A100 GPUs

Model Resolution VRAM (A100) FPS
Hunyuan I2V 480p 60–79 GB 0.027
Pika 576x1024 50–70 GB 0.15
SVD 576x1024 40–60 GB 2.3
PrimitiveWorld (Ours) 480p 22 GB 12

Table: 6DoF Pose Estimation Accuracy (Success Rate)

Method Success Rate ↑
Gen6D (baseline) 5/10
+ Motion Masking 8/10
+ Motion Masking + Outlier Removal 9/10
Full (All + Temporal Filtering) 10/10

Qualitative Results

Generated video rollouts for primitive actions

Sim Pick

Simulation: Pick Object

Sim Open

Simulation: Open Drawer

Sim Push

Simulation: Open the pot lid

Real Pick

Real World: Pick Tape

Real Open

Simulation: Open Box

Real Pull

Real World: Pick Tape

More Qualitative results from our Primitive-Enabled World Model

Ablation on Sim-Real Hybrid Strategy

The first row shows results with Sim-Real Hybrid, which produces more coherent and realistic frames than the second row (without Sim-Real Hybrid).

With Sim-Real Hybrid

With Sim-Real Hybrid

Without Sim-Real Hybrid

Without Sim-Real Hybrid

Direct 6-DoF Trajectory Extraction

We apply Gen6D, an off-the-shelf zero-shot 6-DoF pose estimator, to generated video frames to recover the full 6-DoF motion trajectory of the robot end-effector.

6DoF Trajectory Extraction

Data Collection & Annotation

Multi-View Data Collection

  • 📹 Five synchronized views from RealSense cameras capture full robot embodiment.
  • 🧩 78 high-quality primitives including pick, place, open, close, push, pull, etc.
  • 5× annotation efficiency: One instruction labels all five views simultaneously.
  • 🌐 Hybrid sim-real data: Real-world data augmented with simulation for scalability.
  • 🧠 Manual collection: All data collected by co-authors for maximum quality.
Multi-view data collection Annotation interface

Our annotation interface displays five synchronized views with single-text labeling

Watch the Demo

See how PrimitiveWorld enables real-time, closed-loop robotic control with compositional generalization.

Demo video placeholder
▶️ Watch on YouTube