promote blocksparse from prototype, make it faster #1734

jcaip · 2025-02-19T02:41:37Z

This PR promotes block sparsity from prototype in torchao.

Chiefly, it ports over the triton addmm blocksparse kernels from core, and makes several performance improvements to them.

All of the numbers reported below are for an H100, with blocksize=64 and sparsity_level=0.9. The default dense baseline is 134 tok/s

Adds padding support to the triton kernel for dense matrices with dimension < 16, like those we run into during decoding. (214 -> 218 tok/s)
Changes the default num_stages parameter from 1 to 4. This has a large effect on performance, and it seemed like the default kernel autotuning either does not modify or deems this parameter to be unimportant for some reason. (218 -> 263 tok/s).
Adds an env_var, BSR_AUTOTUNE, that users can use if they want to do kernel autotuning on top of the default parameters. (263 -> 266 tok/s) This seems to matter more for bs=n compute bound workloads, where I see a reduction from 0.3855 to 0.3745s on bs=8192 prefill (roughly 3%)

So in total we are seeing a 1.985x speedup 🚀

I've also updated the documentation to not reference prototype - planning on updating the diagram in a subsequent PR.

Testing

I added a new test case for the padding inputs and moved the test file out of prototype.

python test/sparsity/test_sparse_api.py

Benchmarking

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder
export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --profile baseline_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --sparsity bsr --profile bsr_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --profile baseline
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --sparsity bsr --profile bsr

pytorch-bot · 2025-02-19T02:41:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1734

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending

As of commit dd500a4 with merge base 79ac44e ():

NEW FAILURE - The following job has failed:

Code Analysis with Ruff / build (3.9) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-02-19T03:37:46Z

might be good to consider getting the changes from #1690 in here since you are making a major API change, it will save you a migration in the future.

jcaip · 2025-02-19T03:42:07Z

ah yes that's a good idea. I'll open a subsequent PR and update all of sparsity APIs.

This PR promotes block sparsity from prototype in torchao. Chiefly, it ports over the triton addmm blocksparse kernels from core, and makes several performance improvements to them. All of the numbers reported below are for an H100, with blocksize=64 and sparsity_level=0.9. The default dense baseline is 134 tok/s 1) Adds padding support to the triton kernel for dense matrices with dimension < 16, like those we run into during decoding. (214 -> 218 tok/s) 2) Changes the default [num_stages](triton-lang/triton#512) parameter from 1 to 4. This has a large effect on performance, and it seemed like the default kernel autotuning either does not modify or deems this parameter to be unimportant for some reason. (218 -> 263 tok/s). 3) Adds an env_var, BSR_AUTOTUNE, that users can use if they want to do kernel autotuning on top of the default parameters. (263 -> 266 tok/s) This seems to matter more for bs=n compute bound workloads, where I see a reduction from 0.3855 to 0.3745s on bs=8192 prefill (roughly 3%) So in total we are seeing a **1.985x** speedup 🚀 I've also updated the documentation to not reference prototype - planning on updating the diagram in a subsequent PR. ### Testing I added a new test case for the padding inputs and moved the test file out of prototype. ``` python test/sparsity/test_sparse_api.py ```

jcaip added 27 commits January 21, 2025 11:38

wip

9b18913

cleaned up supermask

19aa5d0

cleanup

c1616c4

update

2e78fc3

added padding to triton kernel

bd3a3b1

wip

44985d2

wip

b32bdfb

wip

5e25b8b

added tests

fe655a2

cleaned up BSR code

b12df57

update generate.py

6df43e0

wip

d7fd295

wip

13e230c

updated

560198f

moved file

89f3ad0

big supermask refactor

b414b49

update

1ff8aa0

bsr triton updateS

d503e5d

wip

c093681

wip

8241ef7

undo benchmark change

684ee07

deleted supermask in prototype

e05680d

update blocksparse API

dc5cf33

Merge branch 'main' into jcaip/llm-bsr

3858fde

wip

98c200b

wip

82fa92c

autotuning working properly now

cc5e13c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2025

jcaip marked this pull request as ready for review February 19, 2025 02:48

jcaip added sparsity topic: bc-breaking Use this tag if this PR breaks backward compatibility performance labels Feb 19, 2025

jcaip added 9 commits February 18, 2025 18:51

moved test file to sparsity

625749d

formatting updates

c1c56ad

update

2657085

wip

a9b24a4

update import

0d09331

update

6aa8228

updated warning

33aa1de

ruff checks

1aa40b2

ruff format

d678b39

jcaip changed the title ~~Jcaip/blocksparse updates~~ promote blocksparse from prototype, make it faster Feb 19, 2025

don't add results yet

a79b312

cpuhrsch approved these changes Feb 19, 2025

View reviewed changes

jcaip added 3 commits February 18, 2025 20:13

add version check for import

3559a47

updated op registration for pt 2.3 compatibility

41a1c57

add dispatch key

dd500a4

jcaip merged commit ceceea5 into main Feb 19, 2025
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

promote blocksparse from prototype, make it faster #1734

promote blocksparse from prototype, make it faster #1734

Uh oh!

jcaip commented Feb 19, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 19, 2025 •

edited

Loading

Uh oh!

vkuzo commented Feb 19, 2025

Uh oh!

jcaip commented Feb 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

promote blocksparse from prototype, make it faster #1734

promote blocksparse from prototype, make it faster #1734

Uh oh!

Conversation

jcaip commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Benchmarking

Uh oh!

pytorch-bot bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1734

❌ 1 New Failure, 1 Pending

Uh oh!

vkuzo commented Feb 19, 2025

Uh oh!

jcaip commented Feb 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jcaip commented Feb 19, 2025 •

edited

Loading

pytorch-bot bot commented Feb 19, 2025 •

edited

Loading