What is How NUMINA Uses Human Embodiment Data to Improve Robot Manipulation

NUMINA is a simple add on that helps text to video models follow numbers in your prompt. If you ask for three cats, it works to show three cats instead of two or four. It does this without retraining the base model.

$How NUMINA Uses Human Embodiment Data to Improve Robot Manipulation$

NUMINA reads the model’s attention maps to find where counting errors start. It then builds a simple layout of where each object should be and guides the model to match the exact count during a second pass. The team shows clear gains on their CountBench tests and across several Wan 2.1 model sizes.

NUMINA Overview

Here is a quick look at the project details.

Item	Details
Type	Training free add on for text to video diffusion models
Purpose	Make the number of objects in the video match the numbers in the prompt
Main features	Finds counting errors from attention maps, builds a countable layout, guides a second pass to correct counts, no retraining, works with EasyCache, simple command line use
Works with	Wan 2.1 text to video models at 1.3B, 5B, and 14B sizes
Input	A text prompt with clear numerals and nouns, for example “Three men are walking in the park.”
Output	A video with the requested number of objects while keeping the scene and motion
Accuracy gains	Up to 7.4 percent better counting on Wan 2.1 1.3B in CountBench tests
Install steps	Clone Wan2.1 and NUMINA, copy module files, pip install requirements
Quick start	One line command with —numina and noun counts in JSON
Code	GitHub repository and simple folder add on inside Wan 2.1
Demo	Side by side videos that compare baseline and NUMINA
Best for	Prompts that need strict counts of people, animals, and items
Website	Project page with demos and notes
Status	Research code from the H EmbodVis team
License	Not specified in the provided notes

NUMINA Key Features

Training free add on that plugs into Wan 2.1. No model fine tune is needed.
Two phase flow. First find a layout with the right count, then guide the second pass to match it.
Attention guided control. It looks at both cross attention and self attention at chosen layers to spot count gaps.
Countable layout. It uses a two stage clustering method to locate object slots you can count.
Works with speed up tools. It supports EasyCache during the first phase to save time.
Cross model use. Shown on Wan 2.1 at 1.3B, 5B, and 14B.
Simple CLI. You can pass noun to count pairs.

NUMINA Use Cases

Ads and media teams that need a fixed number of people or props in a clip.
Education clips where prompts like “Four planets orbit the star” must be exact.
Dataset creation where strict counts matter for later use.

If you are exploring real world task control, you may also like our short guide on robot skills here: Gr1 Manipulation.

How it works

NUMINA runs in two steps. First, it scans attention maps to see how the model spreads focus across words and frames. It picks the heads that best separate the target noun and clusters those maps into a clean, countable layout.

Second, it refines the layout a bit to avoid splits or merges. Then it guides cross attention during a new generation pass so that the video shows the right number of items in the right places.

$NUMINA method overview$

The technology behind it

Attention head selection: it scores attention heads that best track the target noun tokens.
Layout building: it applies MeanShift and DBSCAN in a simple two step flow to get clean, countable regions.
Attention modulation: it adds small biases in cross attention during the second pass so the model follows the layout.
EasyCache support: phase one can reuse transformer outputs to speed up inference.

If you care about data quality topics tied to model outputs, see this short read: Phantom Data.

Installation and setup

Follow these steps exactly as given in the project.

# Clone Wan2.1 repo
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1

# From within the Wan2.1 root directory
git clone https://github.com/H-EmbodVis/NUMINA.git numina_repo
# Copy NUMINA modules
cp -r numina_repo/numina ./numina
# Apply modifications to Wan2.1 files
cp numina_repo/wan/modules/attention.py ./wan/modules/attention.py
cp numina_repo/wan/modules/model.py ./wan/modules/model.py
cp numina_repo/wan/text2video.py ./wan/text2video.py
cp numina_repo/generate.py ./generate.py

# install dependencies
pip install -r numina_repo/requirements.txt

Please follow the Wan2.1 README for model checkpoint downloads and any platform specific setup for example FlashAttention.

Project structure

Here is how the added files fit into Wan 2.1.

Wan2.1/
├── ...
├── numina/ # NUMINA modules (new)
├── __init__.py
├── config.py # All hyperparameters
├── token_mapper.py # Nouns → T5 token index mapping
├── head_selection.py # attention head scoring
├── layout.py # MeanShift + DBSCAN layout pipeline
└── modulation.py # Cross-attention bias for SDPA
├── wan/
├── ...
├── modules/
│ ├── ...
│ ├── attention.py # Modified: extraction + modulation paths
│ └── model.py # Modified: NUMINA state propagation
└── text2video.py # Modified: two-phase pipeline
└── generate.py # Modified: --numina CLI arguments

Modified Wan 2.1 files in four places.

wan/modules/attention.py adds manual attention for extraction and SDPA for modulation
wan/modules/model.py adds NUMINA state management and routing
wan/text2video.py adds generate_numina two phase pipeline with EasyCache integrated
generate.py adds NUMINA CLI arguments

Getting started

Here is the quickest way to try it.

python generate.py \
 --task t2v-1.3B \
 --ckpt_dir /path/to/Wan2.1-T2V-1.3B \
 --prompt "Three men are walking in the park." \
 --numina \
 --numina_noun_counts '{"men": 3}' \
 --size 832*480

NUMINA specific arguments exist for offload options, T5 on CPU, guide scale, seeds, and more. Please refer to the Wan 2.1 docs for the full list.

Tips for better prompts

Keep the noun clear and countable. Use simple nouns like men, cats, cars, or books. Add action and place words after the count part.

For more on companies building across media and AI, here is a light read: Bytedance.

Performance and showcases

Below are short side by side clips that compare outputs.

Showcase 1 — Wan2.1-1.3B(Baseline) This clip shows Wan2.1-1.3B(Baseline) for Case 1. Use it to see the original count behavior before NUMINA.

Showcase 2 — NUMINA This clip shows NUMINA for Case 1 using the same prompt. Check how the object count matches the number in the text.

Showcase 3 — Wan2.1-1.3B(Baseline) This clip shows Wan2.1-1.3B(Baseline) for Case 2. You can compare counts against the label that the text requests.

Showcase 4 — NUMINA This clip shows NUMINA for Case 2 using the same prompt. Look for tighter match between the prompt number and the final video.

Showcase 5 — Wan2.1-1.3B(Baseline) This clip shows Wan2.1-1.3B(Baseline) in a further Case 1 from the addition set. Note any extra or missing objects.

Showcase 6 — NUMINA This clip shows NUMINA for the same Case 1. See how the count aligns with the requested number.

FAQ

Does NUMINA change my base model or retrain it

No. It adds a light module and a two step process. It does not retrain or fine tune the base model.

What prompts work best

Prompts with clear numerals and nouns work best. For example “Four students reading books under the tree” or “Two kittens playing with two yarn balls.”

Can NUMINA work with larger Wan 2.1 models

Yes. The team shows gains on 5B and 14B sizes too. The same flow applies.

What if my noun is not in the token map

Use a simple noun or a close synonym that maps well to T5 tokens. Then set the noun to count pair in the command line.

How to read the demos on the project page

Look at the label and the noun in the prompt first. Then check if the number of objects matches in the video.

If you are comparing outputs across tasks in your pipeline, our short note on data issues may help: Phantom Data.

Image source: How NUMINA Uses Human Embodiment Data to Improve Robot Manipulation