-
Notifications
You must be signed in to change notification settings - Fork 13.7k
llama : (mrope) allow using normal 1D position for text token #13138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/llama-graph.cpp
Outdated
|
|
||
| ggml_tensor * llm_graph_context::build_inp_attn_scale() const { | ||
| auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_token(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); | ||
| auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_embd(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Because build_inp_attn_scale is currently used exclusively by llama 4, do you think we should get rid of n_pos_per_embd and replace it with a GGML_ASSERT(n_pos_per_embd() == 1) ?
The main motivation is to make this code looks less complicated, as there is ~0% chance Qwen model gonna use this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, build_inp_attn_scale should work well even in the case of N pos per token.
That's because the scale is applied per embedding, and the number of embedding is independent from N pos per token.
In any cases, I removed the n_pos_per_embd in 9cd16a3 , merging this PR once the CI is green
src/llama-graph.cpp
Outdated
|
|
||
| ggml_tensor * llm_graph_context::build_inp_attn_scale() const { | ||
| auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_token(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); | ||
| auto inp = std::make_unique<llm_graph_input_attn_temp>(n_pos_per_embd(), hparams.n_attn_temp_floor_scale, hparams.f_attn_temp_scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
For M-RoPE, we want to use normal 1D position for text token.
This is done to simplify the use case of
llama_decode()with text tokens, which is needed for adding Qwen2VL tolibmtmdand toserver.cppThis should also align with #11875, because in the future we want text position to be tracked internally by
libllama