[XPU] Implemented 32bit optimizers in triton #1710

YangKai0616 · 2025-07-16T11:13:50Z

Depends on #1692.

Implemented 32bit optimizers in triton to use of XPU devices.

The PR includes two implementations:

Pure Torch implementation: utilizing torch.compile
Pure Triton implementation: utilizing triton.jit

For the benchmarking on 4096*4096 shapes, the results are as follows:

Pure Torch implementation:

Torch step (eager): 1.075ms
BNB step: 0.516ms
Torch step (eager): 1.058ms
BNB step: 0.517ms
Torch step (eager): 1.080ms
BNB step: 0.527ms
Torch step (eager): 1.069ms
BNB step: 0.539ms
Torch step (eager): 1.034ms
BNB step: 0.526ms

Pure Triton implementation:

Torch step (eager): 1.034ms
BNB step: 0.524ms
Torch step (eager): 1.054ms
BNB step: 0.488ms
Torch step (eager): 1.031ms
BNB step: 0.526ms
Torch step (eager): 1.047ms
BNB step: 0.538ms
Torch step (eager): 1.045ms
BNB step: 0.489ms

For the benchmarking on 1024*1024 shapes, the results are as follows:
Pure Torch implementation:

Torch step (eager): 0.345ms
BNB step: 0.335ms
Torch step (eager): 0.354ms
BNB step: 0.226ms
Torch step (eager): 0.347ms
BNB step: 0.227ms
Torch step (eager): 0.358ms
BNB step: 0.232ms
Torch step (eager): 0.349ms
BNB step: 0.225ms

Pure Triton implementation:

Torch step (eager): 0.346ms
BNB step: 0.226ms
Torch step (eager): 0.337ms
BNB step: 0.216ms
Torch step (eager): 0.338ms
BNB step: 0.215ms
Torch step (eager): 0.333ms
BNB step: 0.226ms
Torch step (eager): 0.349ms
BNB step: 0.235ms

The test platform is Intel(R) Data Center GPU Max 1550. Test script reference #1692. Torch(eager) is 32bit optimizer from torch, BNB is 32bit optimizer.

Considering that the performance gap between torch.compile and Triton implementations is not significant, but triton's implementation compiles faster, and #1692 was implemented with Triton, this PR adopts the Triton version for submission.

Note:Currently, XPU does not support the allocation of memory buffers using a paging mechanism. Therefore, these tests are skipped in tests/test_optim.py::test_optimizer32bit. This functionality will be developed in the future to support full optimizer capabilities.

bitsandbytes/_ops.py

bitsandbytes/backends/triton/kernels_optim.py

bitsandbytes/functional.py

jiqing-feng · 2025-07-23T02:15:48Z

Hi @matthewdouglas , would you please review this PR? Thanks!

yao-matrix · 2025-08-18T23:10:09Z

@matthewdouglas , could you pls help review this PR and #1692, thx very much

matthewdouglas

Looks good to me, thank you!

…ch implementation

YangKai0616 changed the title ~~[XPU] Implemented 32bit optimizers in triton~~ [Draft][XPU] Implemented 32bit optimizers in triton Jul 16, 2025

YangKai0616 changed the title ~~[Draft][XPU] Implemented 32bit optimizers in triton~~ [XPU] Implemented 32bit optimizers in triton Jul 17, 2025

YangKai0616 marked this pull request as ready for review July 17, 2025 10:55

jiqing-feng reviewed Jul 18, 2025

View reviewed changes

bitsandbytes/_ops.py Outdated Show resolved Hide resolved

bitsandbytes/backends/triton/kernels_optim.py Outdated Show resolved Hide resolved

bitsandbytes/functional.py Outdated Show resolved Hide resolved

christoph-koehncke added Intel Optimizers Issues or feature requests relating to optimizers labels Jul 29, 2025

matthewdouglas added this to the v0.49.0 milestone Sep 2, 2025

matthewdouglas modified the milestones: v0.49.0, v0.48.0 Sep 15, 2025

matthewdouglas approved these changes Sep 15, 2025

View reviewed changes

YangKai0616 and others added 5 commits September 15, 2025 10:28

Implemented 32bit optimizers in triton

b8a8a17

Modify Comments

5b784a3

Optimizing pure torch implementation

4e40c7f

Restore the order of parameters and modify the position of pure pytor…

06279af

…ch implementation

Restore files permissions

810e8cb

matthewdouglas force-pushed the 32bit_optimizer branch from 77cce6e to 810e8cb Compare September 15, 2025 14:28

matthewdouglas merged commit 275671b into bitsandbytes-foundation:main Sep 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

[XPU] Implemented 32bit optimizers in triton #1710

Uh oh!

YangKai0616 commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Jul 23, 2025

Uh oh!

yao-matrix commented Aug 18, 2025

Uh oh!

matthewdouglas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[XPU] Implemented 32bit optimizers in triton #1710

[XPU] Implemented 32bit optimizers in triton #1710

Uh oh!

Conversation

YangKai0616 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Jul 23, 2025

Uh oh!

yao-matrix commented Aug 18, 2025

Uh oh!

matthewdouglas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

YangKai0616 commented Jul 16, 2025 •

edited

Loading