What is ReVidgen: The Future of Video Generation for the Physical World

ReVidgen is a research project that makes video AI safer and smarter for robots in the real world. It gives you two things: a clear test (RBench) to score video models on robot tasks, and a huge training dataset (RoVid-X) built from real robot actions.

ReVidgen: The Future of Video Generation for the Physical World

At a high level, you generate videos with your model, and ReVidgen scores how well those videos match tasks like picking, moving, or walking. If you are new to this field, check our simple intro on how text turns into short clips in this text‑to‑video guide.

ReVidgen: The Future of Video Generation for the Physical World Overview

ReVidgen focuses on the “embodied world,” which means robots that act in space with bodies like arms, legs, or wheels. The team evaluates today’s top video models and shows how well they handle tasks and robot types. It also shares a million-scale dataset to help train better models.

Item	Details
Type	Research project and toolkit for robotics video generation
Purpose	Test and train video models that understand physical actions and robot bodies
Main parts	RBench (evaluation), RoVid-X (dataset), usage scripts
What it measures	Tasks (manipulation, spatial, multi-entity, long-horizon, reasoning) and embodiments (single arm, dual arm, quadruped, humanoid)
Who it’s for	AI researchers, robotics labs, product teams testing video models
Status	Paper released; dataset and benchmark planned for Hugging Face release
Project page	https://dagroup-pku.github.io/ReVidgen.github.io/
Code	https://github.com/DAGroup-PKU/ReVidgen
Notable result	Strong match with human ratings (Spearman 0.96)
Data highlights	RGB, depth, and optical flow videos for robot interaction

Teaser Figure

ReVidgen: The Future of Video Generation for the Physical World Key Features

Fine-grained scoring with RBench. It checks five task types and four robot body types, so results are easy to compare across real needs.
Human-like judgment. RBench scores match human ratings very closely (0.96 correlation), which boosts trust in results.
Ready-to-run scripts. You place your model’s videos in a folder, run two bash scripts, and get reports per task and embodiment.
Practical video format rules. A simple folder structure keeps models organized across tasks and robot bodies.
Training-ready data. RoVid-X includes RGB, depth, and optical flow to help models learn motion and action cues.
Support for long actions. The benchmark tracks long-horizon tasks too. For more on long context ideas in video AI, see our short read on long context video models.

R-Bench Statistics

How it Works

You generate sample videos with your model for each task and robot type. Then you place them into a standard folder layout and run the provided scripts to score them. The toolkit uses detectors and trackers to measure if objects, motion, and actions match the prompts.

For example, “manipulation” checks how an arm handles objects, and “spatial” looks at space-aware moves like placing or avoiding. “Embodiments” means the robot body, such as single arm, dual arm, quadruped, or humanoid. This way, you see strengths and gaps in plain terms.

Emb First Frame

The Technology Behind It

Under the hood, ReVidgen ties together strong open modules to read objects, segments, and motion. It includes Grounded-Segment-Anything, Grounded SAM‑2, Q‑Align, BERT, and a tracker (Cotracker). These tools help the benchmark judge actions across frames, not just a single shot.

This mix lets the system check object presence, masks, and movement to build fair scores. Want to compare with a popular production system? Here’s a quick look at ByteDance’s approach in our Goku video model overview.

Failure First Frame

Installation & Setup

Follow these steps exactly. Do not skip any command.

Environment

# 0. Clone the repo
git clone https://github.com/DAGroup-PKU/ReVidgen.git
cd ReVidgen

# 1. Environment for RBench
conda create -n rbench python=3.10.18
conda activate rbench

pip install --upgrade setuptools
pip install torch==2.5.1 torchvision==0.20.1

# Install Grounded-Segment-Anything module
cd pkgs/Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt

# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .

# Install Q-Align module
cd ../Q-Align
pip install -e .

cd ..
pip install -r requirements.txt

Download Checkpoints

Please download the checkpoint files from RBench and organize them under the following directory before running the evaluation:

ReVidgen/
├── checkpoints/
│ ├── BERT
│ │ └── google-bert
│ │ └── bert-base-uncased
│ │ ├── LICENSE
│ │ └── ...
│ ├── GroundingDino
│ │ └── groundingdino_swinb_cogcoor.pth
│ ├── q-future
│ │ └── one-align
│ │ ├── README.md
│ │ └── ...
│ ├── SAM
│ │ └── sam2.1_hiera_large.pt
│ └── Cotracker
│ └── scaled_offline.pth
│
├── eval/
│ ├── 4_embodiments/
│ ├── 5_tasks/
│ └── ...
│
├── pkgs/
│ ├── Grounded-Segment-Anything/
│ └── ...
└── ...

Download RBench Validation Set

# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/RBench

Video Generation Format

Generated videos should be organized following the directory structure below.

ReVidgen/
└── data/
 └── {model_name}/
 └── {task_name/embodiment_name}/
 └── videos/
 ├── 0001.mp4
 ├── 0002.mp4
 ├── 0003.mp4
 └── ...

Quick Start

> **Note:** To enable GPT-based evaluation, please prepare your API key in advance and set the `API_KEY` field in the following evaluation scripts accordingly.

# Run embodiment-oriented evaluation
bash scripts/rbench_eval_4embodiments.sh

# Run task-oriented evaluation
bash scripts/rbench_eval_5tasks.sh

ReVidgen: The Future of Video Generation for the Physical World Use Cases

Compare video models for robot tasks before a lab deployment. See which one scores higher on your target task set.
Train next-gen robot video models with RoVid-X data. Use depth and flow to teach better motion cues.
Benchmark new ideas and publish side-by-side charts. The scripts make results repeatable and easy to check.
Build internal QA for robot data engines. RBench helps as an automated “gate” for quality.

RoVidX-4M Overview

Performance & Showcases

RBench scores 25 models across five task groups and four robot body types. Scores match human ratings well (Spearman 0.96), which adds trust. Top average scores include Wan 2.6 (0.607), Seedance 1.5 Pro (0.584), Wan 2.5 (0.570), Hailuo v2 (0.565), and Veo 3 (0.563).

First Frame

Showcase 1 — Wan 2.6 sample This clip shows how Wan 2.6 handles actions that need clean motion and object control. Look for steady movement and clear task progress in the scene. The model label here is Wan 2.6.

Showcase 2 — Wan 2.6 on embodied tasks Here, Wan 2.6 is tested under the “Rethinking Video Generation Model for the Embodied World” theme. The goal is to see action quality and scene layout together. The model label here is Wan 2.6.

Showcase 3 — Task-wise look at Wan 2.6 This shows task-wise checks, where Wan 2.6 is compared across different task types. Notice how control and timing vary by task. The model label here is Wan 2.6.

Showcase 4 — Task-wise view with Kling 2.6 Pro Now we look at task-wise behavior for Kling 2.6 Pro. Focus on action stability and how objects move across frames. The model label here is Kling 2.6 Pro.

Showcase 5 — Task-wise view with Veo 3 This segment highlights task scores and sample behavior for Veo 3. You can compare it with other clips above for a quick sense of strengths. The model label here is Veo 3.

Showcase 6 — Task-wise view with Seedance 1.5 Pro Here you can check Seedance 1.5 Pro on several tasks in a row. Watch for control of objects and the smoothness of actions. The model label here is Seedance 1.5 Pro.

Data: RoVid-X at a Glance

RoVid-X is a million-scale dataset of robot interactions. It includes RGB videos, depth, and optical flow to teach models about objects and motion. The team plans to release it on Hugging Face after internal review.

This makes the dataset helpful for training models that must understand timing, contact, and cause-and-effect. It can also support transfer to real robot tasks later.

Getting Started Checklist

Install the environment exactly as listed above.
Download all checkpoints and the RBench validation set.
Place your generated videos under data///videos/.
Run the two evaluation scripts to get scores by embodiment and by task.

FAQs

What makes RBench different from other test sets?

RBench focuses on tasks and robot bodies, not just looks. It tells you how a model handles object use, space, timing, and reasoning, and matches that to the body type involved.

Do I need to train a model to use RBench?

No. You can test an existing model by generating videos and placing them in the folder format shown above. Then run the scripts to get scores.

Can I run the evaluation without an API key?

You can run most parts, but GPT-based checks need an API key set in the script. Prepare your key before you start.

How do I learn the basics of text-to-video before trying this?

If you need a simple primer, read our short explainer on text to video basics. It will help you understand prompts, frames, and timing at a high level.

For long context ideas in video AI, see our note on long context video. You can also compare with an industry system in our Goku video model overview.

Image source: ReVidgen: The Future of Video Generation for the Physical World