MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control

MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control

What is MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control

MOC here refers to a research project for long video creation from the official page titled “Mixture of Contexts for Long Video Generation.” The core idea is simple: teach a model to remember just the right parts of past frames so it can keep a story going over a long time, while still running fast.

MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control

This is done with a smart “router” that picks only the most useful chunks of past content. The team reports strong results on both long multi-shot videos and short clips.

Read More: Omnihuman 1.Com

MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Overview

Here is a quick look at the project.

ItemDetails
Project NameMixture of Contexts (MoC) for Long Video Generation
TypeResearch method for long video creation
PurposeKeep story and details in minute-long videos while staying efficient
Key IdeaLearnable Sparse Attention Routing selects only useful history chunks
Long Video SupportMulti-shot, minute-long videos (~180k tokens)
Short Video Support8-second clips at 320×192 (~6.5k tokens)
EfficiencyPrunes ≈85% token pairs on minute-long videos; ≈83% on short clips
Compute SavingsAbout 7× fewer FLOPs for long videos
QualityMaintains or improves video quality on short clips
Training DataLarge-scale long video data
TeamStanford University, ByteDance Seed, Johns Hopkins University, CUHK, ByteDance
PaperResearch Paper (linked on the project page)
Project URLhttps://primecai.github.io/moc/
StatusResearch project; demos available on the project page

To learn about the company behind some authors, see our short profile on ByteDance.

MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Key Features

  • Learnable Sparse Attention Routing: A router learns to pick the most helpful past chunks.
  • Long Context at Lower Cost: Cuts about 85% of token pairs and reduces FLOPs roughly 7× on minute-long runs.
  • Multi-shot Support: Keeps story and character across several shots in one video.
  • Works on Short Clips Too: On 8-second clips, it prunes about 83% of token pairs while keeping, or even improving, quality.

MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control Use Cases

  • Story-first video creation: Keep a clear thread across scenes for ads, teasers, and short films.
  • Pre-ization: Draft long scenes while controlling cost.
  • Education and training: Build long explainer videos with steady topics.
  • Sports or events: Create longer summaries without losing key context.
  • Social content: Produce longer clips with fewer quality drops.

Read More: About

How It Works

Think of a video as many small pieces. The model does not need all of them all the time. The router learns which pieces matter most, then focuses on those.

This makes the model faster and keeps memory clear over long clips. It helps the model keep track of characters, motion, and scene changes across shots.

The router is trained end-to-end from long video data. Over time, it learns which history chunks help most for the next frames.

The Technology Behind It

The routing is “sparse,” meaning it prunes a lot of unhelpful pairs. This is why it speeds up long runs so much.

It is “learnable,” so it improves during training, not hand-coded. The team notes it is also “non-parametric,” which keeps it light.

In tests, it shows strong gains on both long and short video settings. For long videos, it keeps minute-long context with big compute savings.

Performance & Showcases

Showcase 1 — Learnable Sparse Attention Routing

Showcase 2 — Learnable Sparse Attention Routing

Showcase 3 — Learnable Sparse Attention Routing

Showcase 4 — Learnable Sparse Attention Routing

Showcase 5 — Learnable Sparse Attention Routing

Showcase 6 — Learnable Sparse Attention Routing

Getting Started

  • Visit the project page: https://primecai.github.io/moc/. You can watch demos and open the paper from there.
  • Explore the demos to see how the router keeps context over long runs.
  • Read the paper for details on training data and method notes.

If code, models, or setup steps are released later, they will likely appear on the same page. Keep an eye on updates there.

Read More: Omnihuman 1.Com

FAQs

What problem does MOC solve?

Long videos need memory to keep the story clear. MOC teaches a model to focus on the most useful past parts, so it can run longer with less work.

How fast is it on long videos?

On minute-long, multi-shot videos (~180k tokens), it prunes about 85% of token pairs. This cuts FLOPs by around 7× so you can run longer without huge cost.

Does it help short clips too?

Yes. On 8-second clips (~6.5k tokens), it prunes around 83% of token pairs and keeps, or even improves, the final quality.

Image source: MOC: Mixture of Cluster-conditional LoRAs for Vision-Language Control