What is InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models

InSpatio-World is a real time 4D world simulator that turns a single video into a controllable scene. You can move the camera through space and even pause or play with time while the world stays steady and follows physics.

InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models

It builds a stable world state from your input video. Then it lets you sample new views and moments so results feel consistent from frame to frame.

InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models Overview

Here is a quick summary of what the project offers and how it works.

Item	Details
Type	Real time 4D world simulator from video
Purpose	Turn a single video into a stable, controllable world for new views and time control
Main Features	State anchored world state, spatiotemporal autoregression, joint distribution matching distillation
Input	One or more MP4 video files
Output	Novel view videos based on a camera path, plus temporal edits like freeze and resume
Model Size	1.3B main checkpoint with support tools for captions and depth
Requirements	Python 3.10, CUDA 12.1, NVIDIA GPU
Key Tools	Florence 2 for captions, DA3 for depth, Wan2.1 backbone, TAE option for speed
How You Control It	Camera path file with pitch, yaw, and displacement over time
Speed Options	Switch to TAE, compile DiT, reach up to 24 fps on an H series NVIDIA GPU after warm up
Best For	Content creators, researchers, interactive demos, driving scenes
Project Site	Visit the InSpatio-World page to learn more and see examples

For related work and context from the same research family, see our short overview at Bytedance.

InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models Key Features

Builds a lasting local world state from a reference video so objects keep their place in 3D space.
Keeps physics in check with gravity, collisions, and inertia so motion feels natural.
Lets you roam freely by changing viewpoint while staying linked to the same world state.
Offers time control such as freeze and hold frames, then resume motion.
Works end to end with a simple test script that runs captions, depth, and view synthesis.

InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models Use Cases

Creative camera moves for product shots, travel clips, and vlogs from a single take.
Freeze time to focus on a key moment, then rotate or dolly for dramatic effects.
Research on world modeling and long range video consistency.
Driving scene view changes with rotation only control.

If you are exploring 4D content systems, you may also like our quick read on the Ex 4D system for contrast and ideas.

Performance & Showcases

Showcase 1 — Code Demo arXiv This is the core project reel that highlights the full system in action. It points to Code Demo arXiv so you can trace the steps from method to working results.

Showcase 2 — Free Spatial Roaming Watch the camera travel through space while the scene stays stable. This shows what Free Spatial Roaming looks like in practice.

Showcase 3 — Free Spatial Roaming Another clip of Free Spatial Roaming, showing smooth motion across angles with good depth feel. You can see objects hold steady across changes in viewpoint.

Showcase 4 — Free Spatial Roaming A third Free Spatial Roaming demo to stress long camera paths. World details persist as the viewpoint moves.

Showcase 5 — Temporal Control Here the focus is time editing. Temporal Control lets you freeze motion for a set number of frames, then continue.

Showcase 6 — Temporal Control A second Temporal Control example shows how freeze frame choice and duration make different story beats. It is helpful for dramatic pauses or emphasis.

How InSpatio-World Works

The project is built on three simple ideas. The world should respect physics, objects should keep their place in space even off camera, and changes over time should follow cause and effect.

To do this, the system does not just draw pixels. It keeps a local world state that it grows and updates over time.

State-Anchored World Modeling

This state is anchored to a reference video. New views and time steps are sampled from this same state, so the output stays stable across long sequences.

The Technology Behind It

Spatiotemporal autoregression is the sampling process that picks what to show next in space and time. It is guided by the reference video and the current world state.

Spatiotemporal Autoregression Pipeline

Joint distribution matching distillation teaches the model to balance real video quality with user control. This helps it generalize under interaction and avoid drift.

For a friendly primer on world focused systems, take a look at our short note on the Os World project.

Installation & Setup

Follow these steps in order. Copy commands exactly as shown.

Requirements

Python 3.10
CUDA 12.1

Create conda environment:

conda env create -f environment.yml
conda activate inspatio_world

Install flash-attn:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Model Weights

Download the following model checkpoints into the checkpoints/ directory:

bash scripts/download.sh

Expected directory structure after downloading:

checkpoints/
├── InSpatio-World-1.3B/
│ └── InSpatio-World-1.3B.safetensors
├── Wan2.1-T2V-1.3B/
├── DA3/
├── Florence-2-large/
└── taehv/

Inference

The full pipeline runs in three steps:

Step 1 — Generate video captions using Florence-2。
Step 2 — Estimate depth with DA3, convert to inference format, render point clouds
Step 3 — Run InSpatio-World v2v inference

All steps are wrapped in a single script:

bash run_test_pipeline.sh \
 --input_dir ./test/example \
 --traj_txt_path ./traj/x_y_circle_cycle.txt

Quick Start

# 1. Place your .mp4 video(s) in a folder
mkdir -p my_videos
cp your_video.mp4 my_videos/

# 2. Run the full pipeline
bash run_test_pipeline.sh \
 --input_dir ./my_videos \
 --traj_txt_path ./traj/x_y_circle_cycle.txt

# 3. Results will be saved to ./output/my_videos/x_y_circle_cycle/

Trajectory Control

The --traj_txt_path argument controls the camera trajectory for novel-view synthesis. Predefined trajectories are provided in the traj/ directory:

Trajectory File Format

A trajectory file is a plain text file with 3 lines, each containing space-separated keyframe values that are automatically interpolated to match the output frame count:

<line 1> pitch (degrees): positive = orbit up, negative = orbit down
<line 2> yaw (degrees): positive = orbit left, negative = orbit right
<line 3> displacement: relative camera displacement scale

Line 3 (displacement) is a relative scale multiplied by the scene's estimated foreground depth:

When pitch/yaw are non-zero, it controls the orbit radius (typically set to 1)
When both pitch and yaw are zero, it becomes a dolly zoom: positive = move forward (zoom in), negative = move backward (zoom out)

Skip Already-Completed Steps

If Step 1 or Step 2 outputs already exist, you can skip them:

bash run_test_pipeline.sh \
 --input_dir ./my_videos \
 --traj_txt_path ./traj/x_y_circle_cycle.txt \
 --skip_step1 --skip_step2

Generate Temporal Control Videos

bash run_test_pipeline.sh \
 --input_dir ./test/example \
 --traj_txt_path ./traj/x_y_circle_cycle.txt \
 --freeze_repeat 150 \
 --output_folder ./output/example_freeze_repeat_150 \
 --disable_adaptive_frame

You can control the time stop behavior using two specific parameters: use --freeze_frame to choose which frame to freeze (default middle frame), and --freeze_repeat to determine the duration (number of frames) of the pause.

Autonomous Driving Applications

bash run_test_pipeline.sh \
 --input_dir ./test/example3 \
 --traj_txt_path ./traj/x_y_circle_cycle.txt \
 --relative_to_source \
 --rotation_only \
 --disable_adaptive_frame

Speed Up

bash run_test_pipeline.sh \
 --input_dir ./test/example \
 --traj_txt_path ./traj/x_y_circle_cycle.txt \
 --use_tae \
 --disable_adaptive_frame

You can switch from VAE to TAE to accelerate the process. Furthermore, you can use --compile_dit to further boost the speed, reaching 24 fps on an H-series NVIDIA GPU (1.3B). However, please note that this operation requires a relatively long warm-up time when triggered for the first time. It is suitable for scenarios where you need to deploy as a service and pursue extreme speed.

Tips for Best Results

Pick a clear input video with steady motion and good lighting. This helps the world state lock in cleanly.
Try small pitch and yaw changes first, then expand the path once you are happy.
For long runs or service use, enable compile to reach high frame rates after warm up.

Evaluation and Speed Notes

The method is built for real time and interaction. It keeps objects steady over long sequences and helps avoid drift when you move the camera.

Evaluation Chart

This focus on state and causality matches dynamic quality checks used in world score style tests. It is meant for hands on control, not just short clips.

FAQ

What input do I need to start

You only need a standard MP4 video. Place it in a folder and point the script to that folder.

How do I control the camera path

Use a simple text file with three lines for pitch, yaw, and displacement. The script reads it and makes smooth motion across frames.

Can I freeze time in the middle of the clip

Yes. Set --freeze_frame to pick the frame and --freeze_repeat to set how long the pause lasts.

What hardware should I use

Use a CUDA 12.1 compatible NVIDIA GPU for best results. Python 3.10 is required.

How fast can it run

With the speed flags, TAE, and compiled DiT, the system can reach about 24 fps on an H series NVIDIA GPU after an initial warm up. This is best when you keep the service running.

Does it work for driving scenes

Yes. There is an example command for rotation only control that suits driving views. Use the autonomous driving sample in the setup section.

Image source: InSpatio-World: Exploring Real-Time 4D Simulation with Spatiotemporal Models