What is Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL

CUDA Agent is a research project that writes and speeds up GPU code on its own using reinforcement learning (RL). In simple words, it learns to produce faster CUDA kernels that make deep learning run quicker.

Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL

It focuses on real speed, not just passing tests. Every kernel must be correct and at least 5% faster than a strong baseline called torch.compile.

For related background on the company roots and research ties, see our short read on Bytedance.

Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL Overview

Here is a quick summary of what the project is and why it matters.

Item	Details
Type	RL-driven system that writes and tunes CUDA kernels
Purpose	Make PyTorch workloads faster by generating high-speed custom kernels
Who It’s For	AI researchers, ML engineers, CUDA learners, and performance teams
Main Features	Data synthesis at scale, skill-aware coding loop, strict checks, stable long-context RL training, agent workspace (agent_workdir)
Dataset	CUDA-Agent-Ops-6K (6,000 training tasks)
Agent Workspace	SKILL.md, model.py/model_new.py, kernels/, bindings, compile and verification tools
Speed Results (Overall)	96.8% faster rate vs. torch.compile and 2.11x geomean speed-up
Speed Results (Level-3)	90% faster rate vs. torch.compile and 1.52x geomean speed-up
Latest News	2026.02.27: agent workdir added to GitHub; dataset released on Hugging Face
Project Site	https://cuda-agent.github.io/
GitHub	https://github.com/BytedTsinghua-SIA/CUDA-Agent

For more friendly explainers and tech highlights, visit our home base at Omnihuman 1.Com.

ByteDance Seed

Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL Key Features

Agentic RL for kernel coding: The model writes CUDA, compiles it, checks it, profiles it, and improves it over many steps.
High-quality data: A 6,000-sample training set (CUDA-Agent-Ops-6K) built from real PyTorch and Transformers ops, with strict filters.
Skill-aware environment: Clear rules (SKILL.md), protected tools, and fair rewards that push real speed gains.
Long-context training: Stable multi-stage RL with warm-up, smart filtering, and value pretraining for reliable learning.
Full agent workspace: A ready folder (agent_workdir) to build, verify, profile, and iterate on CUDA extensions.

CUDA Agent environment loop

Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL Use Cases

Speed up custom model parts that torch.compile does not push far enough.
Reduce training or inference time for heavy PyTorch workloads.
Teach students and new engineers how real GPU optimization works in practice.
Compare kernel quality across tasks with a strong, repeatable setup.

To learn about the people and values behind our work, visit our About page.

Performance & Showcases

CUDA Agent posts very strong results on KernelBench. On the overall test, it shows a 96.8% faster rate vs. torch.compile and a 2.11x geomean speed-up.

KernelBench benchmark chart for CUDA Agent

On the hardest Level-3 set, it still pushes a 90% faster rate vs. torch.compile and a 1.52x geomean speed-up. This gap is large when compared to strong proprietary models reported by the authors.

Main experimental results on KernelBench

How Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL Works

Start from a PyTorch baseline and measure it.
Write CUDA kernels and C++ bindings, then compile in a GPU sandbox.
Run strict correctness checks and profile speed. Repeat until the kernel is both correct and at least 5% faster than torch.compile.

The reward system gives points for milestones like passing checks and beating speed targets. Anti-cheat rules block shortcuts, so gains reflect real kernel quality.

The Technology Behind It

Data Synthesis at Scale

Training tasks come from a three-step process: collect seed ops, compose them into fused tasks, and filter by strict rules. Only tasks that run well in eager and compile modes, are deterministic, and fall in a fair time window make the cut.

CUDA Agent data synthesis pipeline

Agent Environment and Rewards

The agent follows a ReAct-style loop with coding tools and a CUDA skill spec (SKILL.md). It compiles, debugs, and profiles, while reward controls prevent bad shortcuts like constant outputs or hidden fallbacks.

Stable Long-Context Training

Training uses stages for stability: a single-turn PPO warm-up, Rejection Fine-Tuning (RFT) for the actor, and value pretraining for the critic. This keeps learning steady even with long contexts and many turns.

CUDA Agent training stages

Who Is Behind Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL

CUDA Agent is a joint effort from researchers working on high-performance deep learning. The project site shares news, results, and links to data and code.

Institute for AI Industry Research, Tsinghua University

Installation & Setup (agent_workdir)

The repository includes a ready-to-use agent workspace called agent_workdir. It shows the full loop: write kernels, compile, verify, profile, and iterate.

Key files inside agent_workdir:

SKILL.md: workflow constraints and optimization rules for agent execution
model.py: original PyTorch baseline model
model_new.py: optimized model using the custom CUDA extension
binding.cpp / binding_registry.h: shared Python binding registration infrastructure
kernels/: custom CUDA/C++ kernels and their bindings
utils/compile.py + utils/compile.sh: extension build scripts
utils/verification.py: correctness validation script
utils/profiling.py: performance comparison against baseline and torch.compile

Common commands (run inside agent_workdir):

bash utils/compile.sh
python3 -m utils.verification
python3 -m utils.profiling

Step-by-step guide:

Open a terminal and switch to agent_workdir.
Run the compile script to build the CUDA extension.
Verify correctness.
Profile speed against the baseline and torch.compile.
Edit kernels and repeat to chase more speed.

Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL FAQ

What is a CUDA kernel?

A CUDA kernel is a small program that runs on the GPU. It handles the heavy math that speeds up deep learning.

What does “faster rate vs. torch.compile” mean?

It is the share of tasks where the agent’s kernel runs faster than torch.compile. Higher is better, since it shows more wins across many tasks.

What is CUDA-Agent-Ops-6K?

It is a training dataset with 6,000 tasks. Each task comes from real PyTorch ops and passes strict rules to keep the data clean and useful for RL.

Do I need to change my whole model to try this?

No. The agent_workdir shows how to add a custom extension to speed up just the slow parts. You can keep the rest of your PyTorch code as is.

How does the project prevent fake speed-ups?

It blocks bad patterns like constant outputs and hidden fallbacks. It uses protected scripts and multiple inputs to make sure the gains are real.

Image source: Supercharging GPU Performance: How CUDA Agent Automates High-Speed Kernel Generation with RL