What is DPLM-2: A Multimodal Diffusion Protein Language Model

DPLM-2 is a protein AI model that can work with both sequence and structure at the same time. It can write protein sequences and also predict 3D shapes, all in one go.

DPLM-2: A Multimodal Diffusion Protein Language Model

This helps researchers design proteins faster. It also supports tasks like folding (structure from sequence), inverse folding (sequence from structure), and motif scaffolding (keep a key part, design the rest).

DPLM-2: A Multimodal Diffusion Protein Language Model Overview

DPLM-2 builds on the DPLM family and adds structure modeling to the same model that learns sequences. Instead of using one model for sequence and another for structure, it treats both together.

Here is a quick view of the project.

Item	Details
Type	Multimodal protein language model (sequence + structure)
Purpose	Co-generate protein sequences and 3D structures in one step
Core Idea	Discrete diffusion model with a tokenizer that turns 3D coordinates into tokens
Key Tasks	Co-generation, folding, inverse folding, motif scaffolding, representation learning
Extra Variant	DPLM-2 Bit (bit-based structure tokens for finer control)
Pretrained Models	airkingbd/dplm_650m, airkingbd/dplm2_650m, airkingbd/dplm2_bit_650m
Training Data (DPLM-2)	PDB (experimental) + SwissProt with AF2-predicted structures
Codebase	ByProt-based implementation
Source	Project site and GitHub (training + inference + scripts)
Best For	Protein design teams, biology students, and ML+bio labs
Project Page	https://bytedance.github.io/dplm/dplm-2

You can also learn more about the team background here: ByteDance overview.

DPLM-2: A Multimodal Diffusion Protein Language Model

DPLM-2: A Multimodal Diffusion Protein Language Model Key Features

Sequence–structure co-generation: Create a protein sequence and its 3D structure together.
Folding from sequence: Predict a structure from an input sequence, with strong results close to tools like ESMFold.
Inverse folding: Given a target backbone, generate sequences that are likely to fold to it.
Motif scaffolding: Keep important motifs and design the rest around them with multimodal inputs.
Representation learning: Get structure-aware embeddings for downstream tasks.
Controllable options: Add simple controls such as secondary structure hints.

DPLM-2: A Multimodal Diffusion Protein Language Model Use Cases

Design new proteins with both sequence and structure in one run.
Predict structures for candidate sequences during screening.
Redesign sequences to match a known backbone shape.
Build scaffolds that preserve a functional motif.
Train simple predictors on top of learned embeddings for property tasks.

How It Works

Proteins have letters (amino acids) and shapes (3D coordinates). DPLM-2 turns both into tokens and learns how they relate.

It uses diffusion to iteratively refine tokens and produce clean outputs. A warm-up method starts from a strong sequence-only model (DPLM), then teaches the model about structure without losing past knowledge.

DPLM-2 Bit improves structure prediction by predicting bits of the structure code, not just a single index. This makes structure modeling more fine-grained. For more friendly AI explainers, see our hub: learn more topics.

Getting Started: Installation & Setup

Follow these steps exactly as shown.

# clone project
git clone --recursive https://url/to/this/repo/dplm.git
cd dplm

# create conda virtual environment
env_name=dplm

conda create -n ${env_name} python=3.9 pip
conda activate ${env_name}

# automatically install everything else
bash scripts/install.sh

Load Pretrained Models

Use these lines to load checkpoints in Python.

from byprot.models.dplm import DiffusionProteinLanguageModel as DPLM
from byprot.models.dplm2 import MultimodalDiffusionProteinLanguageModel as DPLM2
from byprot.models.dplm2 import DPLM2Bit

dplm = DPLM.from_pretrained("airkingbd/dplm_650m").cuda()
dplm2 = DPLM2.from_pretrained("airkingbd/dplm2_650m").cuda()
dplm2_bit = DPLM2Bit.from_pretrained("airkingbd/dplm2_bit_650m").cuda()

Quick Examples: Generate Proteins

Here are ready-to-run examples.

Protein sequence generation:

from generate_dplm import initialize_generation

input_tokens = initialize_generation(
 length=200,
 num_seqs=5,
 tokenizer=dplm.tokenizer,
 device=next(dplm.parameters()).device
)
samples = dplm.generate(
 input_tokens=input_tokens,
 max_iter=500,
)
print([''.join(seq.split(' ')) for seq in dplm.tokenizer.batch_decode(samples, skip_special_tokens=True)])

Protein sequence-structure co-generation:

User can check the generated sequence and structure in the ./generation-results folder.

from generate_dplm2 import initialize_generation, save_results

input_tokens = initialize_generation(
 task="co_generation",
 length=200,
 num_seqs=5,
 tokenizer=dplm2.tokenizer,
 device=next(dplm2.parameters()).device
)[0]

samples = dplm2.generate(
 input_tokens=input_tokens,
 max_iter=500,
)
save_results(
 outputs=samples,
 task="co_generation",
 save_dir="./generation-results/dplm2_generation",
 tokenizer=dplm2.tokenizer,
 struct_tokenizer=dplm2.struct_tokenizer, save_pdb=True
)

samples = dplm2_bit.generate(
 input_tokens=input_tokens,
 max_iter=500,
)
save_results(
 outputs=samples,
 task="co_generation",
 save_dir="./generation-results/dplm2_bit_generation",
 tokenizer=dplm2_bit.tokenizer,
 struct_tokenizer=dplm2_bit.struct_tokenizer
)

Training Guide

Download datasets for faster local loading.

DPLM (UniRef50):

bash scripts/download_uniref50_hf.sh

DPLM-2 (PDB + SwissProt with AF2 structures):

bash scripts/download_pdb_swissprot.sh

Train DPLM (sequence-only). This runs on one node with 8 A100 GPUs:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

max_tokens=8192
accumulate_grad_batches=16
# this means the effective batch size is #GPUs(8) * max_tokens(8192) * accumulate_grad_batches(16), resulting in approximately 1 million.

exp=dplm/dplm_650m
model_name=dplm_650m

python train.py \
 experiment=${exp} name=${model_name} \
 datamodule.max_tokens=${max_tokens} \
 trainer.accumulate_grad_batches=${accumulate_grad_batches}

Train DPLM-2 (sequence + structure) with warm-up and LoRA:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

max_tokens=8192
accumulate_grad_batches=1
# this means the effective batch size is #GPUs(8) * max_tokens(8192) * accumulate_grad_batches(1), resulting in approximately 64 thousand.

exp=dplm2/dplm2_650m
model_name=dplm2_650m

python train.py \
 experiment=${exp} name=${model_name} \
 datamodule.max_tokens=${max_tokens} \
 trainer.accumulate_grad_batches=${accumulate_grad_batches}

Train DPLM-2 Bit (bit-based structure tokens):

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

max_tokens=8192
accumulate_grad_batches=1
# this means the effective batch size is #GPU(8) * max_tokens(8192) * accumulate_grad_batches(1), resulting in approximately 64 thousand.

exp=dplm2/dplm2_bit_650m
model_name=dplm2_bit_650m

python train.py \
 experiment=${exp} name=${model_name} \
 datamodule.max_tokens=${max_tokens} \
 trainer.accumulate_grad_batches=${accumulate_grad_batches}

Need a quick profile of our work? Check our story here: learn about us.

Unconditional Generation from CLI

Generate sequences with a pre-trained DPLM model and score with ESMFold.

model_name=dplm_650m # choose from dplm_150m, dplm_650m, dplm_3b
output_dir=generation-results/${model_name}/uncond_generation

mkdir -p generation-results

python generate_dplm.py --model_name airkingbd/${model_name} \
	--seq_lens 100 200 300 400 500 \
	--saveto ${output_dir}

# Evaluation
bash anylasis/plddt_calculate.sh ${output_dir} # compute pLDDT using ESMFold

Performance & Showcases

DPLM-2 co-generates sequences and structures in a single step, removing the need for a two-stage run. It shows strong results across folding, inverse folding, and motif scaffolding.

In folding from sequence, reported results are close to strong baselines like ESMFold. In inverse folding, it produces sequences that are likely to fit the target backbone.

DPLM-2 Bit further improves structure modeling by predicting bits of the structure code, which helps with fine details in shape. Sampling defaults were later tuned to “annealing@2.0:0.1” for better results.

Tips and Notes

Hardware: examples show 8x A100 GPUs. Adjust max_tokens and accumulate_grad_batches to keep total tokens per step in the same range.
Data: use the provided download scripts to cache datasets locally. This speeds up training.
Results: co-generation scripts can save PDB files to the generation-results folder for quick checks.

FAQ

What does “multimodal” mean here?

It means the model reads and writes both protein letters (sequence) and 3D points (structure). DPLM-2 learns how the two connect.

Do I need structure data to start?

For DPLM-2 training, yes. Use the PDB and SwissProt with AF2 structure dataset via the provided download script. For sequence-only work, DPLM is enough.

How big are the models?

The repo mentions DPLM sizes like 150M, 650M, and 3B. DPLM-2 examples use a 650M model name in scripts.

What files do the generation scripts save?

They write sequences and, for DPLM-2, structures to the generation-results folder. You can also ask the script to save PDB files.

You can read a quick overview here: company background. For guides and news, see our main site.

Image source: DPLM-2: A Multimodal Diffusion Protein Language Model