What is Robust High-Resolution Video Matting with Temporal Guidance

This project removes the background from people in videos and keeps fine details like hair and semi-transparent edges. It works fast, even on 4K videos, and it does not need extra help like a trimap or a clean background photo.

It was built by researchers at ByteDance Inc. and the University of Washington. It runs in real time and keeps motion smooth from frame to frame, so results look steady and clean.

Robust High-Resolution Video Matting with Temporal Guidance Overview

Here is a quick look at the project in simple terms.

Item	Details
Type	AI model for human video matting
Purpose	Cut out a person from any video and keep hair, edges, and motion looking natural
Key Strength	Real-time 4K at 76 FPS and HD at 104 FPS on an Nvidia GTX 1080 Ti
Works On	Any video input, no trimap or pre-captured background needed
Core Idea	Uses memory across frames (temporal guidance) for stable, smooth results
Recommended Model	MobileNetv3 (best for most users); ResNet50 (bigger model, small gains)
Outputs	Full composite video, alpha matte (transparency), and foreground
Available Formats	PyTorch, TorchScript, ONNX, TensorFlow, TensorFlow.js
Demos	Web demo (browser), Colab demo (free GPU)
API Options	Simple conversion API, TorchHub model and converter
Team	Shanchuan Lin, Linjie Yang, Imran Saleemi, Soumyadip Sengupta
From	ByteDance Inc. and University of Washington

Robust High-Resolution Video Matting with Temporal Guidance Key Features

Real-time speed: 4K at 76 FPS, HD at 104 FPS on a GTX 1080 Ti.
No extra inputs: Works without a trimap or a clean background image.
Smooth motion: Uses memory across frames to keep edges steady over time.
Multiple outputs: Get composite video, alpha matte, and foreground layers.
Easy tools: Web demo, Colab demo, and a simple converter API.
Broad support: PyTorch, TorchScript, ONNX, TensorFlow, and TensorFlow.js.

Robust High-Resolution Video Matting with Temporal Guidance Use Cases

Virtual backgrounds for video calls and live streams.
Green-screen style editing without a real green screen.
Post-production for YouTube, ads, and short-form content.
Mobile and desktop video apps that need fast, clean people cutouts.
Game engines and AR effects that need stable human mattes.

How It Works (Plain English)

The model looks at each video frame and also remembers what it saw just before. This memory helps it keep hair and edges steady and avoid flicker.

It learns two jobs at once: fine matting and simple segmentation. Training both together makes the model more stable and less likely to break on hard scenes.

You can feed in any video. The model gives you three things: the person only (foreground), a transparency map (alpha), and a final composite if you want to replace the background.

If you want to learn more about AI video creation basics, check out our text-to-video guide.

Installation & Setup (Getting Started)

Follow these steps to run the model and process videos. Use the exact commands and code as shown.

1) Install dependencies

pip install -r requirements_inference.txt

2) Load the model (PyTorch)

import torch
from model import MattingNetwork

model = MattingNetwork('mobilenetv3').eval().cuda() # or "resnet50"
model.load_state_dict(torch.load('rvm_mobilenetv3.pth'))

Tip: MobileNetv3 is the suggested choice for most users. ResNet50 is larger with small gains.

3) Convert a video with the simple API

from inference import convert_video

convert_video(
 model, # The model, can be on any device (cpu or cuda).
 input_source='input.mp4', # A video file or an image sequence directory.
 output_type='video', # Choose "video" or "png_sequence"
 output_composition='com.mp4', # File path if video; directory path if png sequence.
 output_alpha="pha.mp4", # [Optional] Output the raw alpha prediction.
 output_foreground="fgr.mp4", # [Optional] Output the raw foreground prediction.
 output_video_mbps=4, # Output video mbps. Not needed for png sequence.
 downsample_ratio=None, # A hyperparameter to adjust or use None for auto.
 seq_chunk=12, # Process n frames at once for better parallelism.
)

4) Write your own inference loop

from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
from inference_utils import VideoReader, VideoWriter

reader = VideoReader('input.mp4', transform=ToTensor())
writer = VideoWriter('output.mp4', frame_rate=30)

bgr = torch.tensor([.47, 1, .6]).view(3, 1, 1).cuda() # Green background.
rec = [None] * 4 # Initial recurrent states.
downsample_ratio = 0.25 # Adjust based on your video.

with torch.no_grad():
 for src in DataLoader(reader): # RGB tensor normalized to 0 ~ 1.
 fgr, pha, *rec = model(src.cuda(), *rec, downsample_ratio) # Cycle the recurrent states.
 com = fgr * pha + bgr * (1 - pha) # Composite to green background. 
 writer.write(com) # Write frame.

5) TorchHub option (quick load + converter)

# Load the model.
model = torch.hub.load("PeterL1n/RobustVideoMatting", "mobilenetv3") # or "resnet50"

# Converter API.
convert_video = torch.hub.load("PeterL1n/RobustVideoMatting", "converter")

Notes:

Models are available across PyTorch, TorchScript, ONNX, TensorFlow, and TensorFlow.js.
Download weights from the official model links (e.g., Google Drive).
For help with the downsample_ratio and more options, see the inference docs in the repo.

Performance & Showcases

The team measured speed with a reference script. HD used downsample_ratio=0.25 and 4K used 0.125. Older GPUs like GTX 1080 Ti use FP32; newer ones can use FP16 for more speed.

Showcase 1 — YouTube video player. This clip highlights human matting at high resolution with steady motion and fine hair details. You can view it in a YouTube video player.

Showcase 2 — YouTube video player. This second sample also shows stable matting on moving subjects. It runs inside a YouTube video player for quick viewing.

Tips for Best Results

Pick a good downsample_ratio. If the scene is very detailed or 4K, start at 0.125. For HD, 0.25 often works well.
Keep the subject well lit. Clear edges help the alpha matte.
Use the output_alpha and output_foreground files if you plan to edit later in your video tool.

Tools and Demos

You can try a live webcam demo in your browser and see how the model tracks motion. There is also a Colab demo so you can test on your own videos with a free GPU.

There are third-party projects too, such as Android and Unity demos. If ByteDance work interests you, see this short read on Goku video generation.

The Technology Behind It

The model uses a memory across frames to keep track of the background and past motion. This helps keep edges stable and reduces flicker.

Training includes both fine matting and a simpler segmentation goal. This mix makes the model steady even in hard scenes, like fast moves or noisy backgrounds.

FAQ

Do I need a green screen?

No. The model does not need a trimap or a clean background. It can work on any normal video.

What output files can I get?

You can save a full composite, the transparency map (alpha), and the isolated person (foreground). These help with later edits.

Which model should I pick?

Use MobileNetv3 for a good balance of speed and quality. ResNet50 is larger and gives small gains.

Can I run this in real time?

Yes on a strong GPU. The team reports 4K at 76 FPS and HD at 104 FPS on a GTX 1080 Ti.

Is there a simple way to try it?

Yes. Use the web demo or the Colab demo. For code, use the TorchHub loader and the converter API.

Image source: ing High-Resolution Video Matting with Robust Temporal Guidance