-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Anything you want to discuss about vllm.
Progress
- Cmake and build System for Intel XPU/SYCL
- vLLM custom op implementation in SYCL source code
- Integrate Intel XPU backend for basic model inference.
- Support tensor parallelism with Ray for XPU backend
- Integrate with IPEX optimized kernels(eg, page attention) for better performance
- Quantization support
Target Intel GPU device and models
For Intel GPU device(in pytorch context, it's named xpu
), we are trying to make vllm support Intel Xe architecture Graphic cards, including data center MAX GPUs(such as PVC 1550, PVC 1100), and client GPUs(such as Arc A770) natively.
For models, we will make sure vLLM + xpu works well with all existing vLLM supported models.
Design
Python API
Since Intel GPU have similar API (via IPEX) and behavior compare with CUDA device, we just introduce 2 new classes
-
XPUExecutor(extends ExecutorBase), have similar behavior with GpuExecutor, will dispatch to generate this executor class based on device type in LLMEngine and AsyncLLMEngine
-
XPUWorker( extends Worker Class) is used to initial the environment, most of code is shared from parent class.
Torch API
Meanwhile, we introduce torch_sdpa backend (reuse torch scaled_dot_production_attention from CPU backend support) to compute prompt tokens attention since xformers and flash_attn are not supported on Intel GPU.
Custom Op
vLLM implemented many efficient CUDA kernels and packaged as _C library by pybind. These kernels are ported to SYCL, with the same function signatures to replace the CUDA kernels directly. The SYCL custom kernel building procedure is integrated into vLLM CMake build system.
Background & References
Intel Max series GPU: https://www.intel.com/content/www/us/en/products/docs/processors/max-series/overview.html
You can try to get Intel GPU access via Intel Developer Cloud.
Intel extension for pytorch: https://github.com/intel/intel-extension-for-pytorch/tree/xpu-main