MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control

What is MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control
MOC here refers to a research project for long video creation from the official page titled “Mixture of Contexts for Long Video Generation.” The core idea is simple: teach a model to remember just the right parts of past frames so it can keep a story going over a long time, while still running fast.

This is done with a smart “router” that picks only the most useful chunks of past content. The team reports strong results on both long multi-shot videos and short clips.
Read More: Omnihuman 1.Com
MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Overview
Here is a quick look at the project.
| Item | Details |
|---|---|
| Project Name | Mixture of Contexts (MoC) for Long Video Generation |
| Type | Research method for long video creation |
| Purpose | Keep story and details in minute-long videos while staying efficient |
| Key Idea | Learnable Sparse Attention Routing selects only useful history chunks |
| Long Video Support | Multi-shot, minute-long videos (~180k tokens) |
| Short Video Support | 8-second clips at 320×192 (~6.5k tokens) |
| Efficiency | Prunes ≈85% token pairs on minute-long videos; ≈83% on short clips |
| Compute Savings | About 7× fewer FLOPs for long videos |
| Quality | Maintains or improves video quality on short clips |
| Training Data | Large-scale long video data |
| Team | Stanford University, ByteDance Seed, Johns Hopkins University, CUHK, ByteDance |
| Paper | Research Paper (linked on the project page) |
| Project URL | https://primecai.github.io/moc/ |
| Status | Research project; demos available on the project page |
To learn about the company behind some authors, see our short profile on ByteDance.
MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Key Features
- Learnable Sparse Attention Routing: A router learns to pick the most helpful past chunks.
- Long Context at Lower Cost: Cuts about 85% of token pairs and reduces FLOPs roughly 7× on minute-long runs.
- Multi-shot Support: Keeps story and character across several shots in one video.
- Works on Short Clips Too: On 8-second clips, it prunes about 83% of token pairs while keeping, or even improving, quality.
MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Use Cases
- Story-first video creation: Keep a clear thread across scenes for ads, teasers, and short films.
- Pre-ization: Draft long scenes while controlling cost.
- Education and training: Build long explainer videos with steady topics.
- Sports or events: Create longer summaries without losing key context.
- Social content: Produce longer clips with fewer quality drops.
Read More: About
How It Works
Think of a video as many small pieces. The model does not need all of them all the time. The router learns which pieces matter most, then focuses on those.
This makes the model faster and keeps memory clear over long clips. It helps the model keep track of characters, motion, and scene changes across shots.
The router is trained end-to-end from long video data. Over time, it learns which history chunks help most for the next frames.
The Technology Behind It
The routing is “sparse,” meaning it prunes a lot of unhelpful pairs. This is why it speeds up long runs so much.
It is “learnable,” so it improves during training, not hand-coded. The team notes it is also “non-parametric,” which keeps it light.
In tests, it shows strong gains on both long and short video settings. For long videos, it keeps minute-long context with big compute savings.
Performance & Showcases
Showcase 1 — Learnable Sparse Attention Routing
Showcase 2 — Learnable Sparse Attention Routing
Showcase 3 — Learnable Sparse Attention Routing
Showcase 4 — Learnable Sparse Attention Routing
Showcase 5 — Learnable Sparse Attention Routing
Showcase 6 — Learnable Sparse Attention Routing
Getting Started
- Visit the project page: https://primecai.github.io/moc/. You can watch demos and open the paper from there.
- Explore the demos to see how the router keeps context over long runs.
- Read the paper for details on training data and method notes.
If code, models, or setup steps are released later, they will likely appear on the same page. Keep an eye on updates there.
Read More: Omnihuman 1.Com
FAQs
What problem does MOC solve?
Long videos need memory to keep the story clear. MOC teaches a model to focus on the most useful past parts, so it can run longer with less work.
How fast is it on long videos?
On minute-long, multi-shot videos (~180k tokens), it prunes about 85% of token pairs. This cuts FLOPs by around 7× so you can run longer without huge cost.
Does it help short clips too?
Yes. On 8-second clips (~6.5k tokens), it prunes around 83% of token pairs and keeps, or even improves, the final quality.
Image source: MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control