“We can only see a short distance ahead, but we can see plenty there that needs to be done.”
— Alan Turing
We decouple high-level planning (VLM as "cerebrum") from low-level dynamics modeling (PEWM as "cerebellum"), enabling modular, scalable, and real-time embodied intelligence.
Our causal distillation achieves 12 FPS video generation on a single GPU, making it suitable for real robot control loops.
Zero-shot success on unseen (predicate, object) combinations via primitive recombination, enabling flexible task execution.
Our approach is rooted in a fundamental observation about embodied intelligence: data determines generalization. In high-dimensional, continuous robotic control spaces, the scarcity of real-world interaction data severely limits model expressivity and transferability. Unlike language or vision models trained on internet-scale data, embodied agents face a "data desert"—sparse, expensive to collect, and often non-i.i.d.
This leads to a critical bottleneck: without sufficient coverage of state-action trajectories, even powerful diffusion models fail to generalize beyond training distributions. We argue that compositionality through primitives is the key to breaking this barrier— enabling rich behavior from limited data by recombining learned skills.
Figure: PrimitiveWorld achieves higher data efficiency by decomposing complex tasks into reusable action primitives, reducing the need for exhaustive task-specific data collection.
By modeling short-horizon dynamics within primitive-conditioned world models, we decouple skill learning from planning. This enables zero-shot composition— executing unseen task combinations by reusing learned motion priors. As shown below, this architectural choice unlocks strong generalization even in low-data regimes.
Figure: Generalization in Diffusion Models. PrimitiveWorld outperforms end-to-end approaches on unseen (predicate, object) combinations, demonstrating compositional reasoning enabled by structured world modeling.
“Structure your model where data is scarce, and let it learn where data is abundant.”
Schematic of the hierarchical control system
Decomposes high-level tasks into primitive sequences using natural language reasoning.
Generates realistic short-horizon video rollouts conditioned on start-goal heatmaps.
Start-Goal heatmap provides spatial priors for precise control and closed-loop execution.
| Method | Close Box | Open Drawer | 
|---|---|---|
| Image-BC [1] | 62 | 58 | 
| UniPi [2] | 71 | 69 | 
| 4DWM [3] | 76 | 73 | 
| PrimitiveWorld (Ours) | 89 | 85 | 
[1] Jang et al., 2022 | [2] Du et al., 2023 | [3] Zhen et al., 2025
Performance on A100 GPUs
| Model | Resolution | VRAM (A100) | FPS | 
|---|---|---|---|
| Hunyuan I2V | 480p | 60–79 GB | 0.027 | 
| Pika | 576x1024 | 50–70 GB | 0.15 | 
| SVD | 576x1024 | 40–60 GB | 2.3 | 
| PrimitiveWorld (Ours) | 480p | 22 GB | 12 | 
| Method | Success Rate ↑ | 
|---|---|
| Gen6D (baseline) | 5/10 | 
| + Motion Masking | 8/10 | 
| + Motion Masking + Outlier Removal | 9/10 | 
| Full (All + Temporal Filtering) | 10/10 | 
Generated video rollouts for primitive actions
.gif) 
          Simulation: Pick Object
.gif) 
          Simulation: Open Drawer
.gif) 
          Simulation: Open the pot lid
.gif) 
          Real World: Pick Tape
.gif) 
          Simulation: Open Box
.gif) 
          Real World: Pick Tape
More Qualitative results from our Primitive-Enabled World Model
The first row shows results with Sim-Real Hybrid, which produces more coherent and realistic frames than the second row (without Sim-Real Hybrid).
 
            With Sim-Real Hybrid
 
            Without Sim-Real Hybrid
We apply Gen6D, an off-the-shelf zero-shot 6-DoF pose estimator, to generated video frames to recover the full 6-DoF motion trajectory of the robot end-effector.
 
           
          Our annotation interface displays five synchronized views with single-text labeling
See how PrimitiveWorld enables real-time, closed-loop robotic control with compositional generalization.