Skip to content

csutora/adaptive-parallel-decoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

adaptive parallel decoding

uses an absorbing-state diffusion llm in conjunction with a tiny autoregressive lm for left-to-right generation, generating multiple easy tokens in parallel and slowing down at the hard parts. frequently achieves 5x speedups over pure autoregressive text generation.

can load both absorbing-state (mlm) diffusion models and an autoregressive (clm) models with a corresponding diffusion LoRA. make sure the diffusion model and the small autoregressive model are both initialized from the same model family, and generally have the same tokenizer (except of course mask tokens and such). a good example of this is the default Dream 7B alongside Qwen2.5-0.5B.

to run, just install the dependencies and run apd with whatever parameters you desire (the list of which can be found at the top of main):

uv sync
uv run apd.py --prompt="Please explain the Riemann hypothesis"

based on: https://arxiv.org/pdf/2506.00413

About

transformer inference with multi-token sampling based on https://arxiv.org/abs/2506.00413

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages