Skip to content

liyz15/Diffusion-Compressed-Deep-Tokens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

projectPage arXiv huggingface

We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner. Unlike existing methods that employ low-level representations with limited compression rates, DiCoDe utilizes deep tokens with a considerable compression rate (a 1000x reduction in token count). This significant compression is made possible by a tokenizer trained through leveraging the prior knowledge of video diffusion models. Deep tokens enable DiCoDe to employ vanilla AR language models for video generation, akin to translating one visual "language" into another. By treating videos as temporal sequences, DiCoDe fully harnesses the capabilities of language models for autoregressive generation. DiCoDe is scalable using readily available AR architectures, and is capable of generating videos ranging from a few seconds to one minute using only 4 A100 GPUs for training.

Table of Contents

Framework

DiCoDe introduces a two-stage approach for video generation:

  1. Tokenization Stage: A video diffusion model serves as a tokenizer to extract highly-compressed deep tokens from videos. This reduces token count by 1000x compared to traditional methods.

  2. Generation Stage: An autoregressive language model predicts sequences of deep tokens conditioned on text prompts, enabling efficient video generation.

method

Showcase

Tokenization

Original Reconstruction Original Reconstruction

Short Video Generation

A hot air balloon animation A dramatic oil painting of a stormy ocean A drone flying over a coastal town A time-lapse of clouds moving across a blue sky
A time-lapse of a city skyline transitioning from day to night A bird taking a break on a sturdy fence post An elderly couple walking hand in hand, surrounded by a sunset's glow A black and white photograph of an old train traveling through the countryside

Long Video Generation

A close-up of a butterfly landing on a flower A single candle burning brightly in the dark A stunning oil painting depicting a stormy sea with waves crashing dramatically A time-lapse of clouds drifting across a blue sky

Getting Started

Installation

# Clone the repository
git clone https://github.com/liyz15/DiCoDe.git
cd DiCoDe

# Install dependencies
pip install -r requirements.txt

Download Pre-trained Models

You can download OpenCLIP ViT-H/14 here and pre-trained DiCoDe models from huggingface.

For the tokenizer demo, you will need pytorch_model.bin. For the llm demo, you will need gpt2_large.bin and gpt2-large/.

Place or link all models under models directory, the final folder should looks like:

- models/
  - open_clip_pytorch_model.bin
  - pytorch_model.bin
  - gpt2_large.bin
  - gpt2-large/

Inference

Tokenizer Demo

We provide a simple demo script to demonstrate the tokenization process. The script takes a video as input and outputs a reconstructed version using our diffusion-based tokenizer model.

# Run the tokenization demo with default settings
python demo_tokenization.py

# Run with custom input video
python demo_tokenization.py --input_video path/to/your/video.mp4 --output_dir results/

The demo extracts the first and last frames from the input video and uses them to reconstruct the original video, showcasing the effectiveness of our compression and reconstruction capabilities.

Autoregressive Language Model Demo

This demo allows you to generate videos using text prompts. The language model predicts deep tokens autoregressively based on your input prompt.

# Run the LLM-based video generation demo
python demo_llm.py --prompt "Your text prompt here"

Citation

If you find this work helpful, please consider citing:

@article{li2024dicode,
  title={DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models},
  author={Li, Yizhuo and Ge, Yuying and Ge, Yixiao and Luo, Ping and Shan, Ying},
  journal={arXiv preprint arXiv:2412.04446},
  year={2024}
}

Acknowledgement

This project builds upon the work of many outstanding prior research efforts including but not limited to:

Thanks for their great work!

About

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages