-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Description
🚀 The feature, motivation and pitch
PR #3476 added support for loading models with Tensorizer, but has the limitation that it does not support loading a sharded vllm-serialized model to multiple GPUs (see this verification check). Use of sharded models would also benefit from the faster loading and encryption provided by Tensorizer.
This issue open with Tensorizer suggests a couple of approaches to support sharding. With tensor-parallel models, the model is split across the GPUs and the suggestion is to serialize each shard separately.
I have prototyped this approach of splitting the vllm-tensorized model into multiple shards and am working on a PR.
Alternatives
The alternative given in the Tensorizer issue is to deserializing Tensors to CPU memory and then send the tensors to the GPUs. This would decouple the serialization of the model from the sharding configuration, but would also be less efficient.
Additional context
No response