Skip to content

DeepSeek V3 Support #35425

@casper-hansen

Description

@casper-hansen

Model description

Transformer model

DeepSeek V3 is a Transformer model that utilizes Mixture of Experts (similar to Qwen2 MoE) and Multi-head Latent Attention (MLA).

image

Multi-token Prediction

The model is able to predict multiple tokens sequentially at each step through the MTP modules. The first token is generated by the causal LM which feeds the output token into what I would describe as a "Transformer head" to generate additional tokens for the current step. DeepSeek notes in their release that "MTP support is currently under active development within the community, and we welcome your contributions and feedback." (i.e. code for this is not released).

image

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

Transformers Code: https://huggingface.co/deepseek-ai/DeepSeek-V3
GitHub Code (minimal implementation): https://github.com/deepseek-ai/DeepSeek-V3/tree/main/inference
Paper: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions