-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Motivation.
OpenVINO is open source solution for inference deep learning models, including LLMs. OpenVINO supports both Intel and ARM CPUs, Intel integrated and discrete GPUs, NPU and has a good reputation as production ready solution for client and server scenarios. The idea is to create OpenVINO backend for vLLM which will initially support x86 CPU as primary device, later other devices can be enabled.
Because of Intel Optimum HuggingFace extension https://github.com/huggingface/optimum-intel, OpenVINO vLLM backend can support wide range of models, including https://docs.vllm.ai/en/stable/models/supported_models.html
OpenVINO provides better performance compared to current vLLM CPU implementation, which will be shown in integration PR. Also, OpenVINO implementation of Paged Attention operation supports modern vLLM features like chunked prefill and prefix caching.
Proposed Change.
Introduce OpenVINO vLLM backend, which:
- Loads model via optimum-intel extension for HuggingFace https://github.com/huggingface/optimum-intel
- (Optional step) Compresses model weights to low-bit format
- Automatically converts PyTorch model to OpenVINO IR representation, which contains PagedAttention operation
- Custom implementation of OpenVINO model loader, model runner and cache manager to hide OpenVINO API details.
Feedback Period.
No response
CC List.
@WoosukKwon @zhuohan123 @Yard1
Any Other Things.
OpenVINO has a wide list of customers awaiting OpenVINO vLLM backend integrated to upstream vLLM repository.