-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Description
Model description
Transformer model
DeepSeek V3 is a Transformer model that utilizes Mixture of Experts (similar to Qwen2 MoE) and Multi-head Latent Attention (MLA).
Multi-token Prediction
The model is able to predict multiple tokens sequentially at each step through the MTP modules. The first token is generated by the causal LM which feeds the output token into what I would describe as a "Transformer head" to generate additional tokens for the current step. DeepSeek notes in their release that "MTP support is currently under active development within the community, and we welcome your contributions and feedback." (i.e. code for this is not released).
Open source status
- The model implementation is available
- The model weights are available
Provide useful links for the implementation
Transformers Code: https://huggingface.co/deepseek-ai/DeepSeek-V3
GitHub Code (minimal implementation): https://github.com/deepseek-ai/DeepSeek-V3/tree/main/inference
Paper: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

