Skip to content

EnflameTechnology/candle-vllm-gcu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ•ฏ๏ธ Candle-vLLM-GCU

A large language model inference and chat service framework designed for Enflame GCU, built on top of Candle-GCU and the open-source project Candle-vLLM, and fully compatible with the OpenAI API.


English | ็ฎ€ไฝ“ไธญๆ–‡ |

๐Ÿš€ Getting Started

๐Ÿ”ง Build Candle-VLLM-GCU

# Install Rust (version 1.88.0 or higher)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install required system dependencies
sudo apt install libssl-dev pkg-config -y

# Install Enflame's drivers and runtime
sudo ./TopsPlatform_1.4.5.xxxx.run
dpkg -i eccl_3.5.xxx_amd64.deb

# Install bindgen
cargo install bindgen-cli

# Update sub-project
git submodule update --init --recursive
cd candle-vllm

# Build for single-node setup
cargo build --release --features gcu,eccl

# Build for multi-node support (MPI)
sudo apt update
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y
cargo build --release --features gcu,eccl,mpi

โœ… Supported Features

  • โœ… Multi-rank (Multi-GPUs, Multi-Nodes)
  • โœ… Quantization (GPTQ, AWQ)
  • โœ… Continuous Batching
  • โœ… Paged Attention
  • โœ… Chunked Prefill
  • โœ… KV Cache
    • โœ… BF16
    • โœ… FP16
    • โŒ INT8
  • โœ… OpenAI-Compatible Server
  • โŒ Multimodal Models
  • ๐Ÿ› ๏ธ CUDA Graph (Under Development)

โš™๏ธ Build and Running Parameters

  • [ENV_PARAM] cargo run [BUILD_PARAM] -- [PROGRAM_PARAM] [MODEL_ID/MODEL_WEIGHT_PATH]

    Show details

    Example:

    [RUST_LOG=warn] cargo run [--release --features gcu,eccl] -- [--log --dtype bf16 --p 2000 --d 0,1 --mem 8192] [--w /home/weights/QwQ-32B/]

    ENV_PARAM: RUST_LOG=warn

    BUILD_PARAM: --release --features gcu,eccl

    PROGRAM_PARAM๏ผš--log --dtype bf16 --p 2000 --d 0,1 --mem 8192

    MODEL_WEIGHT_PATH: --w /home/weights/QwQ-32B

    where, --p: server port; --d: device ids; --w: weight path (safetensors folder); --f: weight file (for gguf); --m: huggingface model-id; --mem is the key parameter to control KV cache usage (increase this for large batch); --prefill-chunk-size chunk the prefill into size defined in this flag (default 8K, 0 for disable).


๐ŸŽฅ Demo Chat Videos

๐Ÿ”ท DeepSeek-R1 685B (AWQ, ~8 tokens/s, 8 x Enflame S60, offloaded ~120GB to CPU)

๐Ÿ”ท LLaMa3.1 8B (AWQ, ~40 tokens/s, 1 x Enflame S60)


๐Ÿ“Š Model Support & Performance

Currently supported models on Enflame S60 (48GB):

List of 1k decoding results:

Model ID Model Type Supported Speed (BF16, bs=1) Thoughput (BF16, bs=16) Thoughput (W4A16)
#1 LLAMA โœ… 30 tks/s (7B), 27 tks/s (LLaMa3.1 8B) 375 tks/s (LLaMa3.1 8B) 41 tks/s (bs=1), 1185 tks/s (bs=48)
#2 Mistral โœ… 29 tks/s (7B) 330 tks/s (7B) TBD
#3 Phi (v1, v1.5, v2) โœ… TBD TBD TBD
#4 Phi-3 โœ… 38 tks/s (3.8B) 320 tks/s (BF16+F32, 7B) TBD
#5 Yi โœ… 28 tks/s (6B) 305 tks/s (6B) TBD
#6 StableLM โœ… 48 tks/s (3B) 425 tks/s (BF16, 3B) TBD
#7 BigCode/StarCode TBD TBD TBD
#8 ChatGLM TBD TBD TBD
#9 QWen2 โœ… 22 tks/s (14B, tp=2) 322 tks/s (14B, tp=2, bs=32) TBD
#9 Qwen3 โœ… 23 tks/s (8B, bs=1) 607 tks/s (14B, bs=48) TBD
#10 Google Gemma โœ… 51 tks/s (2B) 577 tks/s (2B) TBD
#11 GLM4 โœ… TBD TBD
#12 Moondream-2 (Multimodal LLM) TBD TBD TBD
#13 DeepSeek-V3/R1 (awq 671/685B, offloading) โœ… ~8tks/s (tp=8) 155tks/s (tp=8, bs=48) TBD
#14 QwQ-32B โœ… 10.6 tokens (tp=2) 214 tokens (tp=2, bs=32) TBD

๐Ÿ’ก Usage Examples

Run Uncompressed Models
target/release/candle-vllm --p 2000 --w /home/DeepSeek-R1-Distill-Llama-8B/
Run GPTQ Quantized Models
# convert (8bit gptq) model to Enflame format
python3 transform_safetensors.py --src /path/to/gptq \
--dst /path/to/gptq-enflame --bits 8 --method gptq --group 128 --nk True

# run the converted model
target/release/candle-vllm --dtype bf16 --p 2000 --w /path/to/gptq-enflame
Run AWQ Quantized Models
# convert (4bit awq) model to Enflame format
python3 transform_safetensors.py --src /path/to/awq \
--dst /path/to/awq-enflame --bits 4 --method awq --group 64 --nk True

# run the converted model
target/release/candle-vllm --dtype f16 --p 2000 --w /path/to/awq-enflame

๐Ÿ–ฅ๏ธ Multi-GPU & Multi-Node Inference

Multi-Process, Multi-GPU
# Use card 0 and card 1
target/release/candle-vllm --p 2000 --d 0,1 --weight-path /path/to/model
Multi-Node (MPI) Setup
# Install MPI
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y

# Build
cargo build --release --features gcu,eccl,mpi

# Launch via mpirun (make sure that model weights and candle-vllm binary located in the same folder in different machines)
sudo mpirun -np 16 -x RUST_LOG=info -hostfile ./hostfile \
--allow-run-as-root -bind-to none -map-by slot \
--mca btl_tcp_if_include %NET_INTERFACE% \
target/release/candle-vllm --dtype bf16 --p 2000 \
--d 0,1,2,3,4,5,6,7 --w /data/deepseek-enflame

๐Ÿ’ฌ Chat Frontends

Option 1: Quick Test via chat.py

pip install openai rich click
python3 examples/chat.py
python3 examples/chat.py --live # with markdown support

Option 2: Chat UI with history

# install Rust aichat
cargo install aichat

aichat --serve
# select `openai-compatible`, provide name `candle-vllm`
# paste candle-vllm API Base url, like http://0.0.0.0:2000/v1/ (API Key: empty, LLMs to include: default)
# click "LLM Playground" url
DeepSeek-Distill-LLaMa8B-BF16.mp4

๐Ÿ“ˆ Benchmarking

Run batched benchmark tests:

python3 examples/benchmark.py --batch 16 --max_tokens 1024

Refer to the benchmark.py script for async chat example.


๐Ÿ“ฆ Quantization to Enflame Format

  1. Use transform_safetensors.py to convert models.
  2. Samples:
# 8bit gptq conversion
python3 transform_safetensors.py --src /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit --dst /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame --bits 8 --method gptq --group 128 --nk True

# 4bit awq conversion
python3 transform_safetensors.py --src /data/DeepSeek-R1-AWQ --dst /data/DeepSeek-R1-AWQ-Enflame/ --bits 4 --method awq --group 64 --nk True

# run the converted model
cargo run --release --features gcu -- --p 2000 \
--w /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame

๐Ÿ› ๏ธ TODO

  • Add GGUF model support (e.g., q4_k quantization).
  • Extend support to multimodal models.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages