How RotorQuant Enhances KV Cache Compression in LLMs

What is How RotorQuant Enhances KV Cache Compression in LLMs
RotorQuant is a method that makes large language models run with a much smaller memory footprint by shrinking the key value cache. It does this with smart rotations that keep the model quality high while cutting memory and speeding up both prompt loading and token generation.
How RotorQuant Enhances KV Cache Compression in LLMs Overview
RotorQuant offers drop in KV cache quantization for popular LLM runtimes. It uses simple two dimensional and four dimensional rotations to decorrelate vectors before quantization, then reverses that step during reads. Tests show better quality per bit and higher speed than TurboQuant at the same compression level.
Project Overview
| Item | Details |
|---|---|
| Type | KV cache compression for LLMs |
| Purpose | Reduce memory use and boost speed while keeping quality high |
| Main features | Ten point three times compression at three bit symmetric, better PPL than TurboQuant, faster decode and prefill, works in llama dot cpp, CUDA and Metal paths |
| Quality | WikiText two PPL as low as six point nine one at three bit symmetric K and V on Llama three point one eight B Instruct |
| Speed | Decode up to twenty eight percent faster and prefill about five times faster than TurboQuant at the same compression |
| Memory savings | Save up to four point one six GB on one hundred twenty eight K context with three bit symmetric |
| Cache types | planar3, iso3, planar4, iso4 plus turbo3 and turbo4 in the reference fork |
| Who is it for | Engineers, researchers, and teams shipping long context chat or high volume inference |
| Integration | Drop in flags for llama dot cpp, Python Triton reference for research |
| Learn more | For a short primer on repositories and workflows, see our guide on GitHub basics |
How RotorQuant Enhances KV Cache Compression in LLMs Key Features
- High compression with strong quality. Three bit symmetric K and V gives about ten point three times compression with PPL six point nine one on Llama three point one eight B Instruct.
- Faster inference. Decode runs about twenty eight percent faster and prefill about five times faster than TurboQuant at the same compression.
- Fewer parameters. The rotation math uses only one hundred twenty eight parameters instead of sixteen thousand three hundred eighty four.
- Drop in setup. Works as flags in llama dot cpp. No model retrain needed.
- Flexible cache types. Use planar3 or iso3 for three bit, planar4 or iso4 for four bit. Mix K and V as you need.
- Deferred quantization. Keep K as FP16 during prefill to avoid early error and then quantize on insert during decode.
How RotorQuant Enhances KV Cache Compression in LLMs Use Cases
- Long context chat that must stay fast and cheap. Memory drops by about ten times, so you can serve more users per GPU.
- Edge or small rigs with tight VRAM. Three bit symmetric can fit thirty two K or sixty five K context windows in cards that could not hold them before.
- Research and eval. Try K only compression for five times smaller K cache with zero PPL loss, then move to symmetric K and V for max savings.
- Batch serving. Higher prefill speed lets you push more tokens per second for the first response.
For teams exploring model serving at scale, see our short note on large AI platforms here: Bytedance insights.
Installation and Setup
Below are the exact steps from the project to build and run. Follow them in order.
llama dot cpp path recommended for speed
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache
# CUDA
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Symmetric 3-bit (best quality per bit)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k iso3 --cache-type-v iso3 --host 0.0.0.0 --port 8080
# K-only (zero PPL loss, 5x compression)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k planar3 --cache-type-v f16 --host 0.0.0.0 --port 8080
# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -ctk planar3 -ctv planar3 -p 512 -n 128
# Perplexity
pip install datasets
python3 -c "from datasets import load_dataset; open('/tmp/wiki.txt','w').write('\n'.join(load_dataset('wikitext','wikitext-2-raw-v1',split='test')['text']))"
./build/bin/llama-perplexity -m model.gguf -f /tmp/wiki.txt -ngl 99 -c 2048 \
--cache-type-k iso3 --cache-type-v iso3
Cache types: planar3, iso3, planar4, iso4 (ours) + turbo3, turbo4 (TheTom's WHT)
Python Triton research path
pip install -e . && pip install triton
from turboquant import IsoQuantMSE, PlanarQuantMSE
# IsoQuant: best 4-bit quality (PPL 9.03)
iq = IsoQuantMSE(d=128, bits=4, mode='fast', device='cuda')
x_hat, indices = iq(x)
# PlanarQuant: best 3-bit quality (PPL 10.12)
pq = PlanarQuantMSE(d=128, bits=3, device='cuda')
x_hat, indices = pq(x)
Extra benchmark scripts
python -m turboquant.benchmark_google_parity # PPL (post-prefill)
python -m turboquant.benchmark_perplexity --bits 3 4 # PPL (roundtrip)
python -m turboquant.benchmark_triton # Triton kernel speed
python -m turboquant.poc_high_context --backend planar # High-context generation
Tip: If you are new to managing repos, we have a short friendly intro here: getting started with GitHub.
How it Works
RotorQuant first normalizes each KV vector and stores the size separately. This lets the rest of the math focus on direction.
Next it rotates the vector in small blocks. The rotation breaks up correlations between coordinates while staying cheap to compute.
Then it quantizes each coordinate with learned centroids. When reading back, it applies the matching inverse rotation and restores the stored size.
The Technology Behind It
The method uses small block rotations instead of a big global transform. Two D Givens rotations are called PlanarQuant. Four D quaternion rotations are called IsoQuant.
These small blocks are fast, easy to run in parallel, and need very few parameters. Tests show they keep the direction of vectors in a way that helps attention.
Deferred quantization also matters. Keep the K cache as FP16 during prompt prefill, then quantize on insert during decode to avoid compounding error.
Performance and Showcases
Headline results with Llama three point one eight B Instruct on an RTX 5090 show very strong numbers at three bit symmetric K and V.
- PPL iso3 six point nine one vs turbo3 seven point zero seven. Better quality per bit.
- Decode one hundred nineteen tokens per second vs ninety three. About twenty eight percent faster.
- Prefill three thousand eight hundred twenty two tokens per second vs seven hundred twenty two. About five times faster.
VRAM savings at three bit symmetric and ten point three times compression:
- Eight K context: from two hundred eighty eight MB to twenty eight MB. Save two hundred sixty MB.
- Thirty two K context: from one thousand one hundred fifty two MB to one hundred twelve MB. Save about one point zero four GB.
- One hundred twenty eight K context: from four thousand six hundred eight MB to four hundred forty seven MB. Save about four point one six GB.
Qwen two point five three B K only decode speed also improves:
- RTX 5090 planar3 K gives three hundred sixty seven tokens per second decode and twenty three thousand six hundred tokens per second prefill at PPL nine point nine eight.
- M4 Mac Mini shows gains too with planar3 K.
Needle in Haystack checks pass at eight K, thirty two K, and sixty five K contexts.
For more plain language guides and tools, visit our home page: Omnihuman 1.
Tips and Best Practices
- For best quality per bit at three bit, set both K and V to iso3. For a simple speed boost with no quality drop, try K only planar3 with V as FP16.
- Use CUDA builds on Nvidia GPUs and Metal builds on Apple Silicon. Match the flags shown above.
- For benchmarks, keep the GPU layers high. The examples use ngl ninety nine to keep attention on device.
FAQ
What is the KV cache and why compress it
The KV cache stores past keys and values so the model can attend to them at each new token. It grows with the context length and can fill VRAM fast. Compressing it saves memory and can also speed up reads.
Do I need to fine tune my model to use RotorQuant
No. This is a drop in change in your runtime flags. You do not need to retrain or edit the model weights.
What should I pick between PlanarQuant and IsoQuant
PlanarQuant is the two D option and IsoQuant is the four D option. At three bit, PlanarQuant and IsoQuant both do well, with iso3 showing the best PPL in the headline table. At four bit, IsoQuant is a strong pick.
How do I get zero PPL loss
Use K only compression. Keep V as FP16 as shown in the commands. This gives about five times smaller K cache with no loss in perplexity.
Will this help on Apple Silicon
Yes. There is a Metal build path with fused kernels in the repo. You can use the same cache type flags.
Can I mix cache types for K and V
Yes. The flags let you set K and V independently. Some mixes like planar3 for K and FP16 for V can be fast and stable.
Where can I see the code and updates
The project home has all commands and notes. You can also read our quick primer on repos and issues here: GitHub primer.
Image source: How RotorQuant Enhances KV Cache Compression in LLMs