A large language model inference and chat service framework designed for Enflame GCU, built on top of
Candle-GCUand the open-source projectCandle-vLLM, and fully compatible with the OpenAI API.
English | ็ฎไฝไธญๆ |
# Install Rust (version 1.88.0 or higher)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install required system dependencies
sudo apt install libssl-dev pkg-config -y
# Install Enflame's drivers and runtime
sudo ./TopsPlatform_1.4.5.xxxx.run
dpkg -i eccl_3.5.xxx_amd64.deb
# Install bindgen
cargo install bindgen-cli
# Update sub-project
git submodule update --init --recursive
cd candle-vllm
# Build for single-node setup
cargo build --release --features gcu,eccl
# Build for multi-node support (MPI)
sudo apt update
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y
cargo build --release --features gcu,eccl,mpi- โ Multi-rank (Multi-GPUs, Multi-Nodes)
- โ Quantization (GPTQ, AWQ)
- โ Continuous Batching
- โ Paged Attention
- โ Chunked Prefill
- โ
KV Cache
- โ BF16
- โ FP16
- โ INT8
- โ OpenAI-Compatible Server
- โ Multimodal Models
- ๐ ๏ธ CUDA Graph (Under Development)
-
[
ENV_PARAM] cargo run [BUILD_PARAM] -- [PROGRAM_PARAM] [MODEL_ID/MODEL_WEIGHT_PATH]Show details
Example:
[RUST_LOG=warn] cargo run [--release --features gcu,eccl] -- [--log --dtype bf16 --p 2000 --d 0,1 --mem 8192] [--w /home/weights/QwQ-32B/]ENV_PARAM: RUST_LOG=warnBUILD_PARAM: --release --features gcu,ecclPROGRAM_PARAM๏ผ--log --dtype bf16 --p 2000 --d 0,1 --mem 8192MODEL_WEIGHT_PATH: --w /home/weights/QwQ-32Bwhere,
--p: server port;--d: device ids;--w: weight path (safetensors folder);--f: weight file (for gguf);--m: huggingface model-id;--memis the key parameter to control KV cache usage (increase this for large batch);--prefill-chunk-sizechunk the prefill into size defined in this flag (default 8K,0for disable).
๐ท DeepSeek-R1 685B (AWQ, ~8 tokens/s, 8 x Enflame S60, offloaded ~120GB to CPU) 
๐ท LLaMa3.1 8B (AWQ, ~40 tokens/s, 1 x Enflame S60) 
Currently supported models on Enflame S60 (48GB):
List of 1k decoding results:
| Model ID | Model Type | Supported | Speed (BF16, bs=1) | Thoughput (BF16, bs=16) | Thoughput (W4A16) |
|---|---|---|---|---|---|
| #1 | LLAMA | โ | 30 tks/s (7B), 27 tks/s (LLaMa3.1 8B) | 375 tks/s (LLaMa3.1 8B) | 41 tks/s (bs=1), 1185 tks/s (bs=48) |
| #2 | Mistral | โ | 29 tks/s (7B) | 330 tks/s (7B) | TBD |
| #3 | Phi (v1, v1.5, v2) | โ | TBD | TBD | TBD |
| #4 | Phi-3 | โ | 38 tks/s (3.8B) | 320 tks/s (BF16+F32, 7B) | TBD |
| #5 | Yi | โ | 28 tks/s (6B) | 305 tks/s (6B) | TBD |
| #6 | StableLM | โ | 48 tks/s (3B) | 425 tks/s (BF16, 3B) | TBD |
| #7 | BigCode/StarCode | TBD | TBD | TBD | |
| #8 | ChatGLM | TBD | TBD | TBD | |
| #9 | QWen2 | โ | 22 tks/s (14B, tp=2) | 322 tks/s (14B, tp=2, bs=32) | TBD |
| #9 | Qwen3 | โ | 23 tks/s (8B, bs=1) | 607 tks/s (14B, bs=48) | TBD |
| #10 | Google Gemma | โ | 51 tks/s (2B) | 577 tks/s (2B) | TBD |
| #11 | GLM4 | โ | TBD | TBD | |
| #12 | Moondream-2 (Multimodal LLM) | TBD | TBD | TBD | |
| #13 | DeepSeek-V3/R1 (awq 671/685B, offloading) | โ | ~8tks/s (tp=8) | 155tks/s (tp=8, bs=48) | TBD |
| #14 | QwQ-32B | โ | 10.6 tokens (tp=2) | 214 tokens (tp=2, bs=32) | TBD |
Run Uncompressed Models
target/release/candle-vllm --p 2000 --w /home/DeepSeek-R1-Distill-Llama-8B/Run GPTQ Quantized Models
# convert (8bit gptq) model to Enflame format
python3 transform_safetensors.py --src /path/to/gptq \
--dst /path/to/gptq-enflame --bits 8 --method gptq --group 128 --nk True
# run the converted model
target/release/candle-vllm --dtype bf16 --p 2000 --w /path/to/gptq-enflameRun AWQ Quantized Models
# convert (4bit awq) model to Enflame format
python3 transform_safetensors.py --src /path/to/awq \
--dst /path/to/awq-enflame --bits 4 --method awq --group 64 --nk True
# run the converted model
target/release/candle-vllm --dtype f16 --p 2000 --w /path/to/awq-enflameMulti-Process, Multi-GPU
# Use card 0 and card 1
target/release/candle-vllm --p 2000 --d 0,1 --weight-path /path/to/modelMulti-Node (MPI) Setup
# Install MPI
sudo apt install libopenmpi-dev openmpi-bin clang libclang-dev -y
# Build
cargo build --release --features gcu,eccl,mpi
# Launch via mpirun (make sure that model weights and candle-vllm binary located in the same folder in different machines)
sudo mpirun -np 16 -x RUST_LOG=info -hostfile ./hostfile \
--allow-run-as-root -bind-to none -map-by slot \
--mca btl_tcp_if_include %NET_INTERFACE% \
target/release/candle-vllm --dtype bf16 --p 2000 \
--d 0,1,2,3,4,5,6,7 --w /data/deepseek-enflamepip install openai rich click
python3 examples/chat.py
python3 examples/chat.py --live # with markdown support# install Rust aichat
cargo install aichat
aichat --serve
# select `openai-compatible`, provide name `candle-vllm`
# paste candle-vllm API Base url, like http://0.0.0.0:2000/v1/ (API Key: empty, LLMs to include: default)
# click "LLM Playground" urlDeepSeek-Distill-LLaMa8B-BF16.mp4
Run batched benchmark tests:
python3 examples/benchmark.py --batch 16 --max_tokens 1024Refer to the benchmark.py script for async chat example.
- Use
transform_safetensors.pyto convert models. - Samples:
# 8bit gptq conversion
python3 transform_safetensors.py --src /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit --dst /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame --bits 8 --method gptq --group 128 --nk True
# 4bit awq conversion
python3 transform_safetensors.py --src /data/DeepSeek-R1-AWQ --dst /data/DeepSeek-R1-AWQ-Enflame/ --bits 4 --method awq --group 64 --nk True
# run the converted model
cargo run --release --features gcu -- --p 2000 \
--w /data/Meta-Llama-3.1-8B-Instruct-GPTQ-8bit-Enflame- Add GGUF model support (e.g.,
q4_kquantization). - Extend support to multimodal models.