-
Notifications
You must be signed in to change notification settings - Fork 470
Feat/tensor override #1180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/tensor override #1180
Conversation
yes please, that'll remove a lot of noise from this PR and make it easier to review. |
3dd6110
to
8dc45de
Compare
Okay, done. But now it's going to show a conflict when you merge the Apr_2025 branch because they both "added" tensor_buft_overrides to LLamaModelParams. :P |
CI is failing due to |
I've resolved the minor merge conflict. Just waiting on tests before I merge this. |
/// </summary> | ||
public static class IModelParamsExtensions | ||
{ | ||
private static LLamaTensorBufferOverrideHelper bufferOverrideHelper = new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being static and shared between calls seems like a potential source of bugs. Can a new non-shared one be allocated when it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like it to, but I was thinking it requires some much messier changes to how the ModelParams is converted to LLamaModelParams. Or the memory would have to be allocated in the LLamaModelParams in the first place for proper disposal, but that's a native struct. So we'd have to do something like wrap LLamaModelParams in another class to keep the allocated null-terminated LLamaModelTensorBufferOverride array only until it can be properly disposed by the wrapper class. Again, I don't do much native stuff in C#, so maybe I'm overthinking it and you can just toss an array of structs into LLamaModelParams instead of an IntPtr...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment you're doing disposer.Add(bufferOverrideHelper);
. That means that the bufferOverrideHelper
object will be disposed when necessary (when LLamaSharp has finished using the LLamaModelParams
struct).
So (unless I'm misunderstanding something here) you've currently got a system where the memory you're allocating is deallocated by that dispose call (which seems reasonable!).
But then you've got it as a static, which means the object managing the memory is shared.
/// <summary> | ||
/// Helper for creating and managing tensor buffer overrides | ||
/// </summary> | ||
public class LLamaTensorBufferOverrideHelper : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made private
? It seems like it's only used internally and exposes quite a lot of new API surface if left public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can certainly make it internal
, but not private
unless you want to move it into IModelParamsExtensions
, which I guess is fine (especially with partial classes being a thing now). I think I probably just left it public because ToLlamaModelParams is public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derp, I meant internal
sorry.
That looks like a genuine test failure, due to the new properties in |
Looks like it's just failing because Assert.Equals is doing a by-reference comparison. I'll modify the test to treat the new property the same way as TensorSplits and MetadataOverrides. |
Tested with this configuration in BatchedExecutorSimple: parameters.GpuLayerCount = 99; parameters.TensorBufferOverrides = new List<Abstractions.TensorBufferOverride> { new("blk\.(2[6-9]|[3-4][0-9]).*", "CPU") }; Because I used that to speed up Qwen-3-30B-A3B by a factor of 10 on my machine (though it would likely be less for batching since it's an MoE).
Thanks for all the work on this! |
Based on the April_2025 branch. Should I rebase it?
Example usage:
would make all tensors with 'ffn' in their names offload to CPU, with everything else on GPU.