-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants #14903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Here are performance results from my tests:
|
const uint b_block_idx = (j*p.batch_stride_b + col) / QUANT_K_Q8_1 + b_offset; | ||
cache_b_ds = vec2(data_b[b_block_idx].ds); | ||
[[unroll]] for (uint k = 0; k < 8; k++) { | ||
cache_b_qs[k] = data_b[b_block_idx].qs[k]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a barrier after these shared memory stores, and either after the loads or before the stores for the next iteration.
Seems like you can cut down the loads by having the first 8 threads each do one of the iterations. And ds could just go straight to registers rather than the extra copy through shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not shared memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Maybe it's worth loading the qs values through shared memory? If the issue is with too many small loads like you suggested, then copying through shared memory ought to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I guess I can't tell if the b_block_idx value is shared between threads. So maybe this idea doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea might be to add padding to the q8_1 struct so you can do 16B loads rather than 4B loads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that might be worth it. I know the cuda backend stacks 4 q8_1 blocks in a struct for that reason.
I did a quick before/after on some Q4_0 models, and it looks like the quantization is pretty expensive:
I don't think there's anything particularly wrong with how the quantization is implemented, it's such a small amount of work that it doesn't fill the GPU, and 5090 is just about the worst case for that. I don't have any great suggestions for what to do about this. |
Here's an initial version of an Integer Dot mul_mat_vec shader. So far it seems to improve performance with q4_1 and q5_1, but reduce it with q4_0, q5_0 and q8_0. My guess is that this is because of the 32-bit loads in q4_1 and q5_1, while the rest use 16-bit loads.
@jeffbolznv Would you mind taking a look and letting me know if I have any obvious performance issues in the shader?