Skip to content

[Feature]: Support loading of sharded vLLM serialized models with Tensorizer #4957

@tjohnson31415

Description

@tjohnson31415

🚀 The feature, motivation and pitch

PR #3476 added support for loading models with Tensorizer, but has the limitation that it does not support loading a sharded vllm-serialized model to multiple GPUs (see this verification check). Use of sharded models would also benefit from the faster loading and encryption provided by Tensorizer.

This issue open with Tensorizer suggests a couple of approaches to support sharding. With tensor-parallel models, the model is split across the GPUs and the suggestion is to serialize each shard separately.

I have prototyped this approach of splitting the vllm-tensorized model into multiple shards and am working on a PR.

Alternatives

The alternative given in the Tensorizer issue is to deserializing Tensors to CPU memory and then send the tensors to the GPUs. This would decouple the serialization of the model from the sharding configuration, but would also be less efficient.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions