[Feature]: Tensor Parallelism with non divisble amount of attention heads

### 🚀 The feature, motivation and pitch

I am trying to run a 70B model on a node with 3XA100-80Gi.
2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor parallelism of 3, it raises an error saying that the number of attention heads is not divisble by 3.

I looked into changing the tensor parallelism feature so that it supports an uneven division of the tensors between GPUs.
But I might be missing something here as there are a lot of validations in the codebase to avoid this scenario.
Is it possible to implement tensor parallelism this way?

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Tensor Parallelism with non divisble amount of attention heads #5003

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Tensor Parallelism with non divisble amount of attention heads #5003

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions