What is DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization

DAPO is a training method for large language models that helps them reason better with fewer training s. It comes from ByteDance Seed and Tsinghua AIR, and it is fully open-source so anyone can try it and learn from it.

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization

At its core, DAPO improves how models learn from trial and error. It keeps training stable, reduces noisy signals, and helps the model explore better steps while solving math and logic tasks. The team shows strong results on the AIME 2024 benchmark using the Qwen2.5-32B base model.

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization Overview

DAPO stands for Decoupled Clip and Dynamic sAmpling Policy Optimization. It is built on top of the open-source verl framework, and it includes code, dataset, training recipes, and evaluation scripts. You can run the released model (DAPO-Qwen-32B), test it on AIME 2024, and review full training logs.

Item	Details
Type	Open-source RL system for large language models
Purpose	Train models to reason better with stable rewards and fewer training failures
Main Features	Clip-Higher, Dynamic Sampling, Token-level Policy Gradient Loss, Overlong Reward Shaping
Model Release	DAPO-Qwen-32B (based on Qwen2.5-32B)
Key Result	50 points on AIME 2024 with the full DAPO recipe
Code Base	Built on the verl framework
Datasets	Training: DAPO-Math-17k; Validation: AIME 2024
Who Made It	ByteDance Seed and Tsinghua AIR
Good For	Long reasoning, math problems, step-by-step chain-of-thought

For a broader view on ByteDance work in AI, you can skim our short write-up here: ByteDance background.

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization Key Features

Clip-Higher: This widens the “clip” range in the learning rule so the model explores more and avoids a drop in entropy. In short, it prevents the model from getting stuck in one narrow pattern.
Dynamic Sampling: During training, DAPO filters out prompt groups that are always right or always wrong (accuracy 1 or 0). This keeps useful signals in each batch, which speeds up learning and stabilizes updates.
Token-level Policy Gradient Loss: The model learns signals at the token step level, which matters for long step-by-step answers. This is key for long chain-of-thought tasks.
Overlong Reward Shaping: This reduces reward noise from very long outputs. The result is smoother training and more stable gains.

DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization Use Cases

Train a reasoning model that solves math problems with clear steps and fewer random jumps.
Build a system that answers complex questions with longer but stable explanations.
Reproduce AIME 2024 scores and compare different training settings on real benchmarks.

Installation & Setup (Getting Started)

Follow these exact steps to set up the environment and run inference. Do not skip any command.

Environment Setup

We recommend using conda to setup the environment:

conda create -n dapo python=3.10
conda activate dapo
pip3 install -r requirements.txt

Inference

We provide the model inference code here:

import torch
from transformers import AutoTokenizer
from vllm import SamplingParams, LLM

examples = [
 {
 "question": "Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nFind the largest possible real part of \\[(75+117i)z+\\frac{96+144i}{z}\\]where $z$ is a complex number with $|z|=4$.\n\nRemember to put your answer on its own line after \"Answer:\".",
 "answer": "540"
 },
 {
 "question": "Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nEvery morning Aya goes for a $9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\n\nRemember to put your answer on its own line after \"Answer:\".",
 "answer": "204"
 },
 {
 "question": "Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nLet $\\mathcal{B}$ be the set of rectangular boxes with surface area $54$ and volume $23$. Let $r$ be the radius of the smallest sphere that can contain each of the rectangular boxes that are elements of $\\mathcal{B}$. The value of $r^2$ can be written as $\\frac{p}{q}$, where $p$ and $q$ are relatively prime positive integers. Find $p+q$.\n\nRemember to put your answer on its own line after \"Answer:\".",
 "answer": "721"
 }
]


def main():
 model = "BytedTsinghua-SIA/DAPO-Qwen-32B"

 tokenzier = AutoTokenizer.from_pretrained(model)

 llm = LLM(
 model=model,
 dtype=torch.bfloat16,
 tensor_parallel_size=8,
 gpu_memory_utilization=0.95
 )

 sampling_params = SamplingParams(
 temperature=1.0,
 top_p=0.7,
 max_tokens=20480
 )

 for example in examples:
 question = example["question"]
 answer = example["answer"]
 output = llm.generate(
 prompts=tokenzier.apply_chat_template(conversation=[{"content": question, "role": "user"}],
 add_generation_prompt=True,
 tokenize=False),
 sampling_params=sampling_params
 )
 print(f"***QUESTION***:\n{question}\n***GROUND TRUTH***:\n{answer}\n***MODEL OUTPUT***:\n{output[0].outputs[0].text}\n")
 print("-"*100)

if __name__ == "__main__":
 main()

Evaluation on AIME 2024

To evaluate the model on AIME 2024, we deploy DAPO-Qwen-32B with Ray Serve and vLLM.

To load the model from Huggingface:

serve run eval.llm:build_app model=BytedTsinghua-SIA/DAPO-Qwen-32B tensor-parallel-size=8

# open another terminal
python eval/eval_aime24.py --temperature 1.0 --top_p 0.7 --max_tokens 20480 --model BytedTsinghua-SIA/DAPO-Qwen-32B --test_file eval/aime-2024.parquet

To load the model from local path:

serve run eval.llm:build_app model=aaa/bbb/ccc tensor-parallel-size=8

# open another terminal
python eval/eval_aime24.py --temperature 1.0 --top_p 0.7 --max_tokens 20480 --model ccc --test_file eval/aime-2024.parquet

Tip: After setup, you can visit our home page for more AI stories and simple guides: Omnihuman 1.Com.

How DAPO Works (Plain-English Walkthrough)

DAPO teaches a model by trying an answer, scoring it, and then improving the next try. But it adds a few smart rules so training does not wobble or stall.

First, it widens the clip in the learning rule (Clip-Higher). This keeps the model from becoming too certain too fast.

Second, it picks better training samples (Dynamic Sampling). It drops prompts that are always right or always wrong in a batch, so each step has useful signals.

Third, it learns at the token level. This helps on long answers where each step matters.

Last, it shapes rewards for overlong outputs. This keeps very long answers from adding noise.

The Technology Behind It (In Simple Terms)

Importance ratio clip made wider: This reduces the risk of the model getting stuck with low rsity. The team saw early entropy collapse and fixed it with a larger upper clip.
Batch balance with Dynamic Sampling: It filters out dead-easy and dead-hard groups to keep gradients helpful. That saves steps and keeps updates stable.
Token-level loss for long reasoning: Long chains need token-by-token signals. This gives the model finer feedback at each step.
Reward shaping for very long texts: Long answers can add noise. Shaping this keeps rewards meaningful and steady.

Datasets You Can Use

Training data: DAPO-Math-17k, a cleaned math set made for this project. Validation set: AIME 2024, a well-known math contest benchmark.

You can follow the repo’s training scripts to reproduce scores. The team also shares wandb training records for both the early version (44 on AIME 2024) and the full version (50 on AIME 2024).

Performance & Showcases

DAPO reaches 50 points on AIME 2024 with DAPO-Qwen-32B. This result uses the full recipe with Token-level PG Loss and Dynamic Sampling.

The team also reports stable length growth, steady reward trends, and healthy entropy changes over time. These are all signs of consistent training.

AIME 2024 Performance

Step-by-Step: Run the Released Model

Prepare environment with conda and pip.
Use the inference script shown above to test math problems.
For full AIME 2024 evaluation, start Ray Serve, then run the eval script in another terminal.

If you prefer local weights, use the “local path” commands provided. Keep the same sampling settings for a fair test.

Who Is DAPO For?

Teams building math or logic assistants that need stable training at scale.
Researchers who want a clear, reproducible RL recipe with real logs and checkpoints.
Engineers who want a working baseline to compare their own changes or datasets.

If you are curious about our team and editorial process, you can read a short note here: about our work.

FAQs

What results can I expect out of the box?

With the provided weights and settings, the full DAPO recipe reports 50 on AIME 2024 using Qwen2.5-32B. The early recipe without two features reports 44 on the same benchmark.

Do I need special hardware?

The sample configs show tensor_parallel_size=8 and bfloat16, which suggest multi-GPU use. For smaller hardware, you may adapt settings, but speed and max tokens will change.

Can I train from scratch?

The repo shares training scripts, datasets, and logs to help you reproduce the setup. The team notes they will also share a full reproduction guideline for the Volcano Engine platform.

Image source: DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization