Skip to content

[RFC]: OpenVINO vLLM backend #5377

@ilya-lavrenov

Description

@ilya-lavrenov

Motivation.

OpenVINO is open source solution for inference deep learning models, including LLMs. OpenVINO supports both Intel and ARM CPUs, Intel integrated and discrete GPUs, NPU and has a good reputation as production ready solution for client and server scenarios. The idea is to create OpenVINO backend for vLLM which will initially support x86 CPU as primary device, later other devices can be enabled.

Because of Intel Optimum HuggingFace extension https://github.com/huggingface/optimum-intel, OpenVINO vLLM backend can support wide range of models, including https://docs.vllm.ai/en/stable/models/supported_models.html

OpenVINO provides better performance compared to current vLLM CPU implementation, which will be shown in integration PR. Also, OpenVINO implementation of Paged Attention operation supports modern vLLM features like chunked prefill and prefix caching.

Proposed Change.

Introduce OpenVINO vLLM backend, which:

  • Loads model via optimum-intel extension for HuggingFace https://github.com/huggingface/optimum-intel
  • (Optional step) Compresses model weights to low-bit format
  • Automatically converts PyTorch model to OpenVINO IR representation, which contains PagedAttention operation
  • Custom implementation of OpenVINO model loader, model runner and cache manager to hide OpenVINO API details.

Feedback Period.

No response

CC List.

@WoosukKwon @zhuohan123 @Yard1

Any Other Things.

OpenVINO has a wide list of customers awaiting OpenVINO vLLM backend integrated to upstream vLLM repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions