[Fix][Dlight] (Low-batched-)GeMV on small spatial loops #16775
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue in the dlight GeMV rule and the low-batch GeMV rule. The issue happens when the inner spatial loop has small length (e.g., in the MoE gate layer, this length is usually 8).
The error is because the GeMV scheduling does not make sure that each TIR block reads/writes the same number of local registers, and this inconsistency leads to wrong generated code. For example, in the schedule (prior to this fix), the first TIR block was scheduled to assign each thread 2 local registers, while the second block was scheduled to assign each thread 1 local register, which is incorrect. Unfortunately, this error only shows up when the spatial loop has small length.
One regression test is added.