Phantom: Revolutionizing Subject-Consistent Video Generation with Cross-Modal Alignment

What is Phantom: Subject-Consistent Video Generation with Cross-Modal Alignment
Phantom is a video tool that turns your text and a few face photos into a short, moving clip. It keeps the person’s face and look the same from start to end, even across different scenes.

It lines up three things at once: your words, your pictures, and the final video. This “cross‑modal alignment” helps the model understand who the subject is and what should happen in the scene.
Phantom: Subject-Consistent Video Generation with Cross-Modal Alignment Overview
Phantom is built by the Intelligent Creation Team at ByteDance. It works with single or multiple people, and it sits on top of strong text-to-video and image-to-video backbones. If you want a quick intro to how text turns into moving pictures, see our simple overview here: text-to-video basics.
Project Overview
- Type: Subject-to-video generation framework
- Purpose: Make videos that keep the same person’s identity across all frames
- Inputs: Text prompt + 1 to 4 reference images
- Outputs: Short videos with consistent faces and looks
- Models: Phantom-Wan-1.3B and Phantom-Wan-14B (built on Wan2.1 components)
- Best for: Human subjects, multi-subject scenes, prompt-following
- Hardware: Single GPU or multi-GPU (FSDP supported)
- License/Access: Models downloadable via Hugging Face
- Latest Highlights: Multi-subject support, data release (Phantom-Data), ComfyUI adapter
Phantom: Subject-Consistent Video Generation with Cross-Modal Alignment Key Features
- Strong identity lock-in. The person you put in stays the same across frames and shots.
- Single or multi-subject. Use up to four reference images to guide one or more people.
- Text, image, and video alignment. It connects your words, your images, and the result so they match well.
- Two sizes, two speeds. Use the 1.3B model for lighter runs or the 14B model for stronger results.
- Single GPU or scale up. Run on one card, or use FSDP on many cards for faster work.
- Helpful tips built-in. Change seed, match your prompt to the reference images, and try 24 fps for steadier motion.

If you care about longer video planning or scene memory, you may also like this read: handling long-context video.
Phantom: Subject-Consistent Video Generation with Cross-Modal Alignment Use Cases
- Ads and brand videos: Keep the same brand face in many short clips without reshoots.
- Film pre-s: Try scene ideas with the same actor’s face before spending on sets.
- Social content: Create character-led shorts that stay true to the person across posts.
- Education and music: Opera, choir, and music demos with a steady lead performer.
For a fun look at another ByteDance model family, see our overview of a character-focused generator here: Goku-style video generation.
Performance & Showcases
Showcase 1 — Ne Zha This clip shows “Ne Zha” with strong face and outfit consistency while the motion follows the prompt. It highlights how Phantom holds the subject steady even as the scene moves.
Showcase 2 — The Phantom of the Opera “The Phantom of the Opera” shows stage-ready styling, with the same identity across frames and smooth camera flow. The model keeps the character look while the scene mood shifts.
Showcase 3 — The New Annabelle “The New Annabelle” highlights character-driven storytelling with a fixed face and outfit. Even with mood changes, the core subject stays the same.
How Phantom Works (Plain Talk)
- You give Phantom a few clear face images and a text prompt that describes the look and the scene. It then makes a video where the person keeps the same identity.
- Phantom aligns text, image, and video. This means it learns how your words relate to the pictures and then to each frame in the clip.
- It uses a joint text-image injection method so the subject’s features carry through many frames, not just the first one.
Installation & Setup (Getting Started)
Follow the steps below as-is. Do not skip any command.
Clone the repo:
git clone https://github.com/Phantom-video/Phantom.git
cd Phantom
Install dependencies:
# Ensure torch >= 2.4.0
pip install -r requirements.txt
Model Download
First you need to download the 1.3B original model of Wan2.1, since our Phantom-Wan model relies on the Wan2.1 VAE and Text Encoder model. Download Wan2.1-1.3B using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
Then download the Phantom-Wan-1.3B and Phantom-Wan-14B model:
huggingface-cli download bytedance-research/Phantom --local-dir ./Phantom-Wan-Models
Alternatively, you can manually download the required models and place them in the Phantom-Wan-Models folder.
Run Subject-to-Video Generation
Phantom-Wan-1.3B
- Single-GPU inference
python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42
- Multi-GPU inference using FSDP + xDiT USP
pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42
Notes
- Changing --ref_image can achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4.
- To achieve the best generation results, we recommend that you describe the content of the reference image as accurately as possible when writing --prompt. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons".
- When the generated video is unsatisfactory, the most straightforward solution is to try changing the --base_seed and modifying the description in the --prompt.
- For more inference examples, please refer to "infer.sh". You will get the following generated results:
Phantom-Wan-14B
- Single-GPU inference
python generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref12.png,examples/ref13.png" --prompt "扎着双丸子头,身着红黑配色并带有火焰纹饰服饰,颈戴金项圈、臂缠金护腕的哪吒,和有着一头淡蓝色头发,额间有蓝色印记,身着一袭白色长袍的敖丙,并肩坐在教室的座位上,他们专注地讨论着书本内容。背景为柔和的灯光和窗外微风拂过的树叶,营造出安静又充满活力的学习氛围。"
- Multi-GPU inference using FSDP + xDiT USP
pip install "xfuser>=0.4.1"
torchrun --nproc_per_node=8 generate.py --task s2v-14B --size 832*480 --frame_num 121 --sample_fps 24 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models --ref_image "examples/ref14.png,examples/ref15.png,examples/ref16.png" --dit_fsdp --t5_fsdp --ulysses_size 8 --ring_size 1 --prompt "一位戴着黄色帽子、身穿黄色上衣配棕色背带的卡通老爷爷,在装饰有粉色和蓝色桌椅、悬挂着彩色吊灯且摆满彩色圆球装饰的清新卡通风格咖啡馆里,端起一只蓝色且冒着热气的咖啡杯,画面风格卡通、清新。"
Notes
- The currently released Phantom-Wan-14B model was trained on 480P data but can also be applied to generating videos at 720P and higher resolutions, though the results may be less stable. We plan to release a version further trained on 720P data in the future.
- The Phantom-Wan-14B model was trained on 24fps data, but it can also generate 16fps videos, similar to the native Wan2.1. However, the quality may experience a slight decline.
- It is recommended to generate horizontal videos, as they tend to produce more stable results compared to vertical videos.
- For more inference examples, please refer to "infer.sh". You will get the following generated results:
- The GIF videos are compressed.
The Tech Behind It (In Simple Words)
- Built on Wan2.1 parts. Phantom uses Wan2.1’s VAE and text encoder, then adds its own subject module.
- Text-image-video triplets. It learns from sets where a text and an image match a video, which helps it stay true to both your prompt and your subject.
- Scales with hardware. You can run on one GPU for tests or many GPUs for longer, higher frame clips.
If you are new to prompt writing for video, this quick primer can help: how text becomes video.
Tips for Better Results
- Use 1–4 clean face images. Front and side views help.
- Write prompts that describe your reference images. Mention key items like hair, clothes, and colors.
- Try a few base seeds. If the clip is not great, change --base_seed and adjust the prompt.
FAQ
Do I need the Wan2.1 model files?
Yes. Phantom-Wan depends on Wan2.1’s VAE and text encoder, so you must download Wan2.1-1.3B first before running Phantom.
How many reference images can I use?
You can use up to four. More variety can help the model keep the same face from different angles.
Can I make 720p videos?
Yes, but the 14B model is trained on 480p. You can still render 720p or higher, but results may be less stable.
What frame rates work best?
24 fps is the target for the 14B model. It can do 16 fps too, but quality may drop a bit.
Does it support multi-GPU?
Yes. You can use FSDP and the provided torchrun commands to split work across many GPUs.
Image source: https://phantom-video.github.io/Phantom/