DeepSeek V3 Support

### Model description

#### Transformer model

DeepSeek V3 is a Transformer model that utilizes Mixture of Experts (similar to Qwen2 MoE) and Multi-head Latent Attention (MLA).

![image](https://github.com/user-attachments/assets/351e5e4b-63c5-47c0-888d-109c90a78549)

#### Multi-token Prediction

The model is able to predict multiple tokens sequentially at each step through the MTP modules. The first token is generated by the causal LM which feeds the output token into what I would describe as a "Transformer head" to generate additional tokens for the current step. DeepSeek notes in their release that *"MTP support is currently under active development within the community, and we welcome your contributions and feedback."* (i.e. code for this is not released).

![image](https://github.com/user-attachments/assets/35bce43f-9b7a-4a95-9062-35ae072e9771)


### Open source status

- [X] The model implementation is available
- [X] The model weights are available

### Provide useful links for the implementation

Transformers Code: https://huggingface.co/deepseek-ai/DeepSeek-V3
GitHub Code (minimal implementation): https://github.com/deepseek-ai/DeepSeek-V3/tree/main/inference
Paper: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepSeek V3 Support #35425

Model description

Transformer model

Multi-token Prediction

Open source status

Provide useful links for the implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DeepSeek V3 Support #35425

Description

Model description

Transformer model

Multi-token Prediction

Open source status

Provide useful links for the implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions