Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Overview

🧠 Cerebro-Cerebellar Hierarchy

We decouple high-level planning (VLM as "cerebrum") from low-level dynamics modeling (PEWM as "cerebellum"), enabling modular, scalable, and real-time embodied intelligence.

⚡ Real-Time Generation

Our causal distillation achieves 12 FPS video generation on a single GPU, making it suitable for real robot control loops.

🧩 Compositional Generalization

Zero-shot success on unseen (predicate, object) combinations via primitive recombination, enabling flexible task execution.

Data Philosophy

Our approach is rooted in a fundamental observation about embodied intelligence: data determines generalization. In high-dimensional, continuous robotic control spaces, the scarcity of real-world interaction data severely limits model expressivity and transferability. Unlike language or vision models trained on internet-scale data, embodied agents face a "data desert"—sparse, expensive to collect, and often non-i.i.d.

This leads to a critical bottleneck: without sufficient coverage of state-action trajectories, even powerful diffusion models fail to generalize beyond training distributions. We argue that compositionality through primitives is the key to breaking this barrier— enabling rich behavior from limited data by recombining learned skills.

Data efficiency analysis across embodied AI methods

Figure: PrimitiveWorld achieves higher data efficiency by decomposing complex tasks into reusable action primitives, reducing the need for exhaustive task-specific data collection.

By modeling short-horizon dynamics within primitive-conditioned world models, we decouple skill learning from planning. This enables zero-shot composition— executing unseen task combinations by reusing learned motion priors. As shown below, this architectural choice unlocks strong generalization even in low-data regimes.

Generalization performance of diffusion-based world models

Figure: Generalization in Diffusion Models. PrimitiveWorld outperforms end-to-end approaches on unseen (predicate, object) combinations, demonstrating compositional reasoning enabled by structured world modeling.

“Structure your model where data is scarce, and let it learn where data is abundant.”

Methodology

Cerebro-Cerebellar Architecture

🧠 VLM Planner ("Cerebrum"): Decomposes tasks into primitive sequences using natural language.
🦴 PEWM ("Cerebellum"): Generates realistic short-horizon video rollouts conditioned on start-goal heatmaps.
🎯 SGG Guidance: Start-Goal heatmap provides spatial priors for precise control.
🔁 Closed-Loop Execution: VLM iteratively queries PEWM for next primitive execution.

Schematic of the hierarchical control system

System Architecture

Cerebro-Cerebellar Hierarchy

🧠

VLM Planner (Cerebrum)

Decomposes high-level tasks into primitive sequences using natural language reasoning.

🦾

PEWM (Cerebellum)

Generates realistic short-horizon video rollouts conditioned on start-goal heatmaps.

🎯

SGG Guidance

Start-Goal heatmap provides spatial priors for precise control and closed-loop execution.

Execution Workflow:

VLM planner decomposes "Pick up yellow tape measure" into sequence: approach object → close gripper → lift
For continuous 6-DoF motions, current observation is sent to PEWM for video generation
Generated video frames are processed by Gen6D to extract end-effector trajectory
For discrete gripper actions (open/close), commands are executed symbolically without video generation
Robot executes trajectory, and system iterates to the next primitive

Quantitative Results

Table: Success Rates on RLBench Tasks (%)

Method	Close Box	Open Drawer
Image-BC ^[1]	62	58
UniPi ^[2]	71	69
4DWM ^[3]	76	73
PrimitiveWorld (Ours)	89	85

^[1] Jang et al., 2022 | ^[2] Du et al., 2023 | ^[3] Zhen et al., 2025

Table: Video Generation Efficiency

Performance on A100 GPUs

Model	Resolution	VRAM (A100)	FPS
Hunyuan I2V	480p	60–79 GB	0.027
Pika	576x1024	50–70 GB	0.15
SVD	576x1024	40–60 GB	2.3
PrimitiveWorld (Ours)	480p	22 GB	12

Table: 6DoF Pose Estimation Accuracy (Success Rate)

Method	Success Rate ↑
Gen6D (baseline)	5/10
+ Motion Masking	8/10
+ Motion Masking + Outlier Removal	9/10
Full (All + Temporal Filtering)	10/10

Qualitative Results

Generated video rollouts for primitive actions

Simulation: Pick Object

Simulation: Open Drawer

Simulation: Open the pot lid

Real World: Pick Tape

Simulation: Open Box

Real World: Pick Tape

More Qualitative results from our Primitive-Enabled World Model

Ablation on Sim-Real Hybrid Strategy

The first row shows results with Sim-Real Hybrid, which produces more coherent and realistic frames than the second row (without Sim-Real Hybrid).

With Sim-Real Hybrid

Without Sim-Real Hybrid

Data Collection & Annotation

Multi-View Data Collection

📹 Five synchronized views from RealSense cameras capture full robot embodiment.
🧩 78 high-quality primitives including pick, place, open, close, push, pull, etc.
⚡ 5× annotation efficiency: One instruction labels all five views simultaneously.
🌐 Hybrid sim-real data: Real-world data augmented with simulation for scalability.
🧠 Manual collection: All data collected by co-authors for maximum quality.

Our annotation interface displays five synchronized views with single-text labeling

Learning Primitive Embodied World Models