🕯️ Candle-vLLM-GCU

A large language model inference and chat service framework designed for Enflame GCU, built on top of Candle-GCU and the open-source project Candle-vLLM, and fully compatible with the OpenAI API.

English | 简体中文 |

🚀 Getting Started

🔧 Build Candle-VLLM-GCU

# Install Rust (version 1.88.0 or higher)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install required system dependencies
sudo apt install libssl-dev pkg-config -y

# Install Enflame's drivers and runtime
sudo ./TopsPlatform_1.4.5.xxxx.run
dpkg -i eccl_3.5.xxx_amd64.deb

# Install bindgen
cargo install bindgen-cli

# Update sub-project
git submodule update --init --recursive
cd candle-vllm

# Build for single-node setup
cargo build --release --features gcu,eccl

# Build for multi-node support (MPI)
sudo apt update
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y
cargo build --release --features gcu,eccl,mpi

✅ Supported Features

✅ Multi-rank (Multi-GPUs, Multi-Nodes)
✅ Quantization (GPTQ, AWQ)
✅ Continuous Batching
✅ Paged Attention
✅ Chunked Prefill
✅ KV Cache
- ✅ BF16
- ✅ FP16
- ❌ INT8
✅ OpenAI-Compatible Server
❌ Multimodal Models
🛠️ CUDA Graph (Under Development)

⚙️ Build and Running Parameters

[ENV_PARAM] cargo run [BUILD_PARAM] -- [PROGRAM_PARAM] [MODEL_ID/MODEL_WEIGHT_PATH]
Show details

Example:
```
[RUST_LOG=warn] cargo run [--release --features gcu,eccl] -- [--log --dtype bf16 --p 2000 --d 0,1 --mem 8192] [--w /home/weights/QwQ-32B/]
```
ENV_PARAM: RUST_LOG=warn

BUILD_PARAM: --release --features gcu,eccl

PROGRAM_PARAM：--log --dtype bf16 --p 2000 --d 0,1 --mem 8192

MODEL_WEIGHT_PATH: --w /home/weights/QwQ-32B

where, --p: server port; --d: device ids; --w: weight path (safetensors folder); --f: weight file (for gguf); --m: huggingface model-id; --mem is the key parameter to control KV cache usage (increase this for large batch); --prefill-chunk-size chunk the prefill into size defined in this flag (default 8K, 0 for disable).

🎥 Demo Chat Videos

🔷 DeepSeek-R1 685B (AWQ, ~8 tokens/s, 8 x Enflame S60, offloaded ~120GB to CPU)

🔷 LLaMa3.1 8B (AWQ, ~40 tokens/s, 1 x Enflame S60)

📊 Model Support & Performance

Currently supported models on Enflame S60 (48GB):

List of 1k decoding results:

Model ID	Model Type	Supported	Speed (BF16, bs=1)	Thoughput (BF16, bs=16)	Thoughput (W4A16)
#1	LLAMA	✅	30 tks/s (7B), 27 tks/s (LLaMa3.1 8B)	375 tks/s (LLaMa3.1 8B)	41 tks/s (bs=1), 1185 tks/s (bs=48)
#2	Mistral	✅	29 tks/s (7B)	330 tks/s (7B)	TBD
#3	Phi (v1, v1.5, v2)	✅	TBD	TBD	TBD
#4	Phi-3	✅	38 tks/s (3.8B)	320 tks/s (BF16+F32, 7B)	TBD
#5	Yi	✅	28 tks/s (6B)	305 tks/s (6B)	TBD
#6	StableLM	✅	48 tks/s (3B)	425 tks/s (BF16, 3B)	TBD
#7	BigCode/StarCode	TBD	TBD	TBD
#8	ChatGLM	TBD	TBD	TBD
#9	QWen2	✅	22 tks/s (14B, tp=2)	322 tks/s (14B, tp=2, bs=32)	TBD
#9	Qwen3	✅	23 tks/s (8B, bs=1)	607 tks/s (14B, bs=48)	TBD
#10	Google Gemma	✅	51 tks/s (2B)	577 tks/s (2B)	TBD
#11	GLM4	✅	TBD	TBD
#12	Moondream-2 (Multimodal LLM)	TBD	TBD	TBD
#13	DeepSeek-V3/R1 (awq 671/685B, offloading)	✅	~8tks/s (tp=8)	155tks/s (tp=8, bs=48)	TBD
#14	QwQ-32B	✅	10.6 tokens (tp=2)	214 tokens (tp=2, bs=32)	TBD

💡 Usage Examples

Run Uncompressed Models

target/release/candle-vllm --p 2000 --w /home/DeepSeek-R1-Distill-Llama-8B/

Run GPTQ Quantized Models

# convert (8bit gptq) model to Enflame format
python3 transform_safetensors.py --src /path/to/gptq \
--dst /path/to/gptq-enflame --bits 8 --method gptq --group 128 --nk True

# run the converted model
target/release/candle-vllm --dtype bf16 --p 2000 --w /path/to/gptq-enflame

Run AWQ Quantized Models

# convert (4bit awq) model to Enflame format
python3 transform_safetensors.py --src /path/to/awq \
--dst /path/to/awq-enflame --bits 4 --method awq --group 64 --nk True

# run the converted model
target/release/candle-vllm --dtype f16 --p 2000 --w /path/to/awq-enflame

🖥️ Multi-GPU & Multi-Node Inference

Multi-Process, Multi-GPU

# Use card 0 and card 1
target/release/candle-vllm --p 2000 --d 0,1 --weight-path /path/to/model

Multi-Node (MPI) Setup

# Install MPI
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y

# Build
cargo build --release --features gcu,eccl,mpi

# Launch via mpirun (make sure that model weights and candle-vllm binary located in the same folder in different machines)
sudo mpirun -np 16 -x RUST_LOG=info -hostfile ./hostfile \
--allow-run-as-root -bind-to none -map-by slot \
--mca btl_tcp_if_include %NET_INTERFACE% \
target/release/candle-vllm --dtype bf16 --p 2000 \
--d 0,1,2,3,4,5,6,7 --w /data/deepseek-enflame

💬 Chat Frontends

Option 1: Quick Test via `chat.py`

pip install openai rich click
python3 examples/chat.py
python3 examples/chat.py --live # with markdown support

Option 2: Chat UI with history

# install Rust aichat
cargo install aichat

aichat --serve
# select `openai-compatible`, provide name `candle-vllm`
# paste candle-vllm API Base url, like http://0.0.0.0:2000/v1/ (API Key: empty, LLMs to include: default)
# click "LLM Playground" url

DeepSeek-Distill-LLaMa8B-BF16.mp4

📈 Benchmarking

Run batched benchmark tests:

python3 examples/benchmark.py --batch 16 --max_tokens 1024

Refer to the benchmark.py script for async chat example.

📦 Quantization to Enflame Format

Use transform_safetensors.py to convert models.
Samples:

# 8bit gptq conversion
python3 transform_safetensors.py --src /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit --dst /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame --bits 8 --method gptq --group 128 --nk True

# 4bit awq conversion
python3 transform_safetensors.py --src /data/DeepSeek-R1-AWQ --dst /data/DeepSeek-R1-AWQ-Enflame/ --bits 4 --method awq --group 64 --nk True

# run the converted model
cargo run --release --features gcu -- --p 2000 \
--w /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame

🛠️ TODO

Add GGUF model support (e.g., q4_k quantization).
Extend support to multimodal models.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
candle-vllm @ c350a6f		candle-vllm @ c350a6f
resources		resources
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README-CN.md		README-CN.md
README.md		README.md
transform_safetensors.py		transform_safetensors.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕯️ Candle-vLLM-GCU

🚀 Getting Started

🔧 Build Candle-VLLM-GCU

✅ Supported Features

⚙️ Build and Running Parameters

🎥 Demo Chat Videos

📊 Model Support & Performance

💡 Usage Examples

🖥️ Multi-GPU & Multi-Node Inference

💬 Chat Frontends

Option 1: Quick Test via `chat.py`

Option 2: Chat UI with history

📈 Benchmarking

📦 Quantization to Enflame Format

🛠️ TODO

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

EnflameTechnology/candle-vllm-gcu

Folders and files

Latest commit

History

Repository files navigation

🕯️ Candle-vLLM-GCU

🚀 Getting Started

🔧 Build Candle-VLLM-GCU

✅ Supported Features

⚙️ Build and Running Parameters

🎥 Demo Chat Videos

📊 Model Support & Performance

💡 Usage Examples

🖥️ Multi-GPU & Multi-Node Inference

💬 Chat Frontends

Option 1: Quick Test via chat.py

Option 2: Chat UI with history

📈 Benchmarking

📦 Quantization to Enflame Format

🛠️ TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Option 1: Quick Test via `chat.py`

Packages