What is From Dance to MIDI: Multi-Instrument Music Generation with Movement

From Dance to MIDI (Dance2MIDI) is a research project that turns a solo dance video into multi-instrument MIDI music. It reads the body movement in the video and then writes drum beats first, followed by melody and harmony parts across many instruments.

From Dance to MIDI: Multi-Instrument Music Generation with Movement

This project is built by a team from Zhejiang University, National University of Singapore, ByteDance AI Lab, and Beijing Film Academy. It comes with a large paired dataset and a working demo with sample videos and audio.

From Dance to MIDI: Multi-Instrument Music Generation with Movement Overview

Here is a quick view of what the project offers and how it is set up today.

Item	Details
Type	Research project with demo site
Purpose	Create multi-track MIDI music from a single-person dance video
Inputs	Solo dance video clip (30 seconds)
Outputs	Multi-instrument MIDI (drums plus up to 12 melodic/harmony instruments)
Main Features	New paired dataset (D2MIDI), drum-first then multi-track generation, movement + style features, polyphonic output
Supported Styles	Classical, hip-hop, ballet, modern, latin, house, pop
Dataset Size	71,754 video–MIDI pairs, each 30 seconds
Instruments in MIDI	Up to 13 types (e.g., Piano, Guitar, Violin, Strings, Brass, Sax, Piccolo, Synth, Pads, Drums)
Status	Demos online; dataset and full code planned for release
Code Availability	“Implementation — Coming soon!” on GitHub
Maintainers/Authors	Bo Han, Yuheng Li, Yixuan Shen, Yi Ren, Feilin Han
Project URL	https://dance2midi.github.io/

To learn more about one of the partner labs, see our short profile here: ByteDance.

From Dance to MIDI: Multi-Instrument Music Generation with Movement Key Features

Multi-instrument MIDI, not just one melody line. It writes drums first, then adds other tracks to build a full song.
Reads both dance style and movement. It studies joint points from the video to understand energy, steps, and flair.
Paired dataset at scale. 71,754 aligned pairs help the model learn strong links between moves and music.
Polyphonic output. Notes can overlap and stack, just like real songs.
Works across many dance styles. Pop, classical, ballet, latin, house, and more.

From Dance to MIDI: Multi-Instrument Music Generation with Movement Use Cases

Choreographers can turn practice clips into music drafts for quick feedback.
Educators can show how movement shapes rhythm and melody in class.
Creators can spark new song ideas from movement alone.
Filmmakers and short-form video makers can test dance-matched music options fast.

Performance & Showcases

Below are short descriptions for each demo. The label shows the dance style tag used by the team.

Showcase 1 — Pop Dance synced MIDI music built from movement cues

Showcase 2 — Pop Dance with clear drum groove and layered instruments

Showcase 3 — Pop Dance where footwork lines up with drum hits

Showcase 4 — Pop Dance showing melody rising over a steady beat

Showcase 5 — Pop Dance with fills and accents that match big moves

Showcase 6 — Pop Dance blending rhythm and harmony from the same video

For more creative video AI, you might enjoy this related read: Goku Video Generation.

How From Dance to MIDI Works

Dance2MIDI starts by finding human joints in the dance video. These points help the system read posture, speed, and style. It builds two kinds of features: how you move and the style of your dance.

Next, the system writes the drum track. Drums set the pulse and groove. A decoder model predicts the drum notes step by step.

Then the system fills in other instruments. A second model studies the full music context and completes the missing parts. This builds a rich, multi-track MIDI song.

Responsive image

The Technology Behind It (Plain English)

Motion reading with graphs: The body has joints like shoulders, elbows, knees, and ankles. A graph model reads how these points move together so the system can read style and motion.
Drum-first decoding: A Transformer-based decoder writes drum events in order. Drums lock in the rhythm that other tracks can follow.
Complete the rest with a context model: A BERT-like model looks at the whole piece. It fills in the remaining notes for instruments like piano, strings, guitar, brass, and more.

If you like broad coverage of AI topics, here is a friendly hub to explore: Omnihuman 1.Com.

The D2MIDI Dataset

D2MIDI is the first large, paired dance-to-MIDI dataset for many instruments. It includes only solo dancers and filters out low-quality clips. Every video is aligned with multi-instrument, polyphonic MIDI.

Clips are 30 seconds long (600 frames). The team samples clips using a sliding window of 40 frames to reach 71,754 pairs. Styles include classical, hip-hop, ballet, modern, latin, and house.

Each pair can include up to 13 instrument types. These include Acoustic Grand Piano, Celesta, Drawbar Organ, Acoustic Guitar (nylon), Acoustic Bass, Violin, String Ensemble 1, SynthBrass 1, Soprano Sax, Piccolo, Lead 1 (square), Pad 1 (new age), and Drum.

Installation & Setup

Right now, the public GitHub notes the following project status.

Release the D2MIDI dataset
Release the demo video.
Release the main codes for implementation.
Implementation: Coming soon!

There are no install commands yet on the repository. The team plans to share the dataset and main code. Check the project site and GitHub for updates.

Try It Today: Quick Steps

Watch the demo videos on the project page to hear results in action. Compare the original audio bar with the generated bars.
Note how kicks, snares, and fills match strong body moves. Listen for how piano, strings, and other parts sit on top of the beat.
Keep an eye on the GitHub “Implementation” section for the code release and dataset drop.

FAQ

What does the system need as input?

It needs a solo dance video clip. The model reads body joints and timing to guide the music.

What format is the output?

It outputs MIDI with many instruments. You can load the MIDI in a DAW to change sounds or edit notes.

Can I use it for styles other than pop?

Yes, the dataset covers classical, hip-hop, ballet, modern, latin, and house. The demos on the website also show variety.

Is the code available today?

Not yet. The GitHub says the main code and dataset will be released. The demo videos are already online.

How long is each training clip?

Each clip is 30 seconds. This length helps the model learn steady rhythm and song structure.

Who built this project?

Authors are from Zhejiang University, National University of Singapore, ByteDance AI Lab, and Beijing Film Academy. You can find names and links on the project page.

Image source: From Dance to MIDI: Multi-Instrument Music Generation with Movement