-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Add support for SmallThinker model series #14898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -938,6 +938,100 @@ ggml_tensor * llm_graph_context::build_moe_ffn( | |||
return moe_out; | |||
} | |||
|
|||
ggml_tensor * llm_graph_context::build_moe_ffn_from_probs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code duplication is unfortunate, is it possible to merge this into build_moe_ffn
with probs
as a toggle without making too much of a mess?
Can be a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point. I've been thinking about the best way to merge these and have a couple of ideas on how we could approach it.
- As you suggested, we could modify
build_moe_ffn
to accept an optionalprobs
parameter. The main difficulty here is that the logic for weight normalization and activation functions diverges significantly between the two paths, so it would require some careful internal branching to keep it clean. - Alternatively, we could extract the initial router logic (logits and probs calculation) into its own function.
build_moe_ffn
would then have a check at the beginning to decide whether to call this new router function. My main concern with this approach is thatbuild_moe_ffn
is a core function, and I'm a bit worried about affecting other models, so this would need careful testing.
Both approaches seem feasible. Given the complexity and your suggestion that this can be a follow-up, would you prefer I handle this in a separate PR, or should I proceed with one of these solutions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate PR is probably best.
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Purpose
SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS (the team behind the high-speed inference framework PowerInfer) and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.
This PR is to add support for the SmallThinker series of models to llama.cpp.
Modifications
build_moe_ffn_from_probs
, to handle SmallThinker's unique architecture where the MoE router is positioned before the attention block.set_dense_start_swa_pattern
. While the existingset_swa_pattern
function enables a pattern where every Nth layer is dense, starting the count from SWA layers, the new function allows the pattern to start with a dense layer.Testing
Clone the model from https://huggingface.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and use
convert-hf-to-gguf.py
to convert to gguf format.full output