From Dance to MIDI: Multi-Instrument Music Generation with Movement

What is From Dance to MIDI: Multi-Instrument Music Generation with Movement
From Dance to MIDI (Dance2MIDI) is a research project that turns a solo dance video into multi-instrument MIDI music. It reads the body movement in the video and then writes drum beats first, followed by melody and harmony parts across many instruments.

This project is built by a team from Zhejiang University, National University of Singapore, ByteDance AI Lab, and Beijing Film Academy. It comes with a large paired dataset and a working demo with sample videos and audio.
From Dance to MIDI: Multi-Instrument Music Generation with Movement Overview
Here is a quick view of what the project offers and how it is set up today.
| Item | Details |
|---|---|
| Type | Research project with demo site |
| Purpose | Create multi-track MIDI music from a single-person dance video |
| Inputs | Solo dance video clip (30 seconds) |
| Outputs | Multi-instrument MIDI (drums plus up to 12 melodic/harmony instruments) |
| Main Features | New paired dataset (D2MIDI), drum-first then multi-track generation, movement + style features, polyphonic output |
| Supported Styles | Classical, hip-hop, ballet, modern, latin, house, pop |
| Dataset Size | 71,754 video–MIDI pairs, each 30 seconds |
| Instruments in MIDI | Up to 13 types (e.g., Piano, Guitar, Violin, Strings, Brass, Sax, Piccolo, Synth, Pads, Drums) |
| Status | Demos online; dataset and full code planned for release |
| Code Availability | “Implementation — Coming soon!” on GitHub |
| Maintainers/Authors | Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han |
| Project URL | https://dance2midi.github.io/ |
To learn more about one of the partner labs, see our short profile here: ByteDance.
From Dance to MIDI: Multi-Instrument Music Generation with Movement Key Features
- Multi-instrument MIDI, not just one melody line. It writes drums first, then adds other tracks to build a full song.
- Reads both dance style and movement. It studies joint points from the video to understand energy, steps, and flair.
- Paired dataset at scale. 71,754 aligned pairs help the model learn strong links between moves and music.
- Polyphonic output. Notes can overlap and stack, just like real songs.
- Works across many dance styles. Pop, classical, ballet, latin, house, and more.
From Dance to MIDI: Multi-Instrument Music Generation with Movement Use Cases
- Choreographers can turn practice clips into music drafts for quick feedback.
- Educators can show how movement shapes rhythm and melody in class.
- Creators can spark new song ideas from movement alone.
- Filmmakers and short-form video makers can test dance-matched music options fast.
Performance & Showcases
Below are short descriptions for each demo. The label shows the dance style tag used by the team.
Showcase 1 — Pop Dance synced MIDI music built from movement cues
Showcase 2 — Pop Dance with clear drum groove and layered instruments
Showcase 3 — Pop Dance where footwork lines up with drum hits
Showcase 4 — Pop Dance showing melody rising over a steady beat
Showcase 5 — Pop Dance with fills and accents that match big moves
Showcase 6 — Pop Dance blending rhythm and harmony from the same video
For more creative video AI, you might enjoy this related read: Goku Video Generation.
How From Dance to MIDI Works
Dance2MIDI starts by finding human joints in the dance video. These points help the system read posture, speed, and style. It builds two kinds of features: how you move and the style of your dance.
Next, the system writes the drum track. Drums set the pulse and groove. A decoder model predicts the drum notes step by step.
Then the system fills in other instruments. A second model studies the full music context and completes the missing parts. This builds a rich, multi-track MIDI song.

The Technology Behind It (Plain English)
- Motion reading with graphs: The body has joints like shoulders, elbows, knees, and ankles. A graph model reads how these points move together so the system can read style and motion.
- Drum-first decoding: A Transformer-based decoder writes drum events in order. Drums lock in the rhythm that other tracks can follow.
- Complete the rest with a context model: A BERT-like model looks at the whole piece. It fills in the remaining notes for instruments like piano, strings, guitar, brass, and more.
If you like broad coverage of AI topics, here is a friendly hub to explore: Omnihuman 1.Com.
The D2MIDI Dataset
D2MIDI is the first large, paired dance-to-MIDI dataset for many instruments. It includes only solo dancers and filters out low-quality clips. Every video is aligned with multi-instrument, polyphonic MIDI.
Clips are 30 seconds long (600 frames). The team samples clips using a sliding window of 40 frames to reach 71,754 pairs. Styles include classical, hip-hop, ballet, modern, latin, and house.
Each pair can include up to 13 instrument types. These include Acoustic Grand Piano, Celesta, Drawbar Organ, Acoustic Guitar (nylon), Acoustic Bass, Violin, String Ensemble 1, SynthBrass 1, Soprano Sax, Piccolo, Lead 1 (square), Pad 1 (new age), and Drum.
Installation & Setup
Right now, the public GitHub notes the following project status.
- Release the D2MIDI dataset
- Release the demo video.
- Release the main codes for implementation.
- Implementation: Coming soon!
There are no install commands yet on the repository. The team plans to share the dataset and main code. Check the project site and GitHub for updates.
Try It Today: Quick Steps
- Watch the demo videos on the project page to hear results in action. Compare the original audio bar with the generated bars.
- Note how kicks, snares, and fills match strong body moves. Listen for how piano, strings, and other parts sit on top of the beat.
- Keep an eye on the GitHub “Implementation” section for the code release and dataset drop.
FAQ
What does the system need as input?
It needs a solo dance video clip. The model reads body joints and timing to guide the music.
What format is the output?
It outputs MIDI with many instruments. You can load the MIDI in a DAW to change sounds or edit notes.
Can I use it for styles other than pop?
Yes, the dataset covers classical, hip-hop, ballet, modern, latin, and house. The demos on the website also show variety.
Is the code available today?
Not yet. The GitHub says the main code and dataset will be released. The demo videos are already online.
How long is each training clip?
Each clip is 30 seconds. This length helps the model learn steady rhythm and song structure.
Who built this project?
Authors are from Zhejiang University, National University of Singapore, ByteDance AI Lab, and Beijing Film Academy. You can find names and links on the project page.
Image source: From Dance to MIDI: Multi-Instrument Music Generation with Movement