-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Fix broken group offloading with block_level for models with standalone layers #12692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix broken group offloading with block_level for models with standalone layers #12692
Conversation
…lock-level offloading
|
@vladmandic would you be interested in testing this out a bit? |
|
@sayakpaul this patch should also fix #12096 since both have the same root cause (standalone conv layers not tracked in block-level offloading), and this handles both Conv2d (SDXL) and Conv3d (Wan). The fix should work for WanVACEPipeline as well. |
|
Very cool! Feel free to add some lightweight tests to this PR and the final outputs so that we can also test ourselves. |
Yes, I added tests in I also created a standalone test script that validates SDXL VAE and AutoencoderKLWan with both block_level and leaf_level offloading. Output of the script. Pytest output |
|
Thanks for the comprehensive testing! I meant to ask for an even more minimal test script that utilizes group offloading with from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("...", torch_dtype=torch.bfloat16)
pipe.transformer.enable_group_offloading(...)
# move rest of the components to CUDA
...
# inference
pipe(...) |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting started on the PR.
| # Do NOT add the container name to modules_with_group_offloading here, because we need | ||
| # parameters from non-computational sublayers (like GroupNorm) to be gathered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand a bit more on this?
| cumulated_absmax, 1e-5, f"Output differences for {name} exceeded threshold: {cumulated_absmax:.5f}" | ||
| ) | ||
|
|
||
| def test_vae_like_model_with_standalone_conv_layers(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leverage an existing model implementations that were reported to be problematic with small configs and use them for testing here.
| if torch.device(torch_device).type not in ["cuda", "xpu"]: | ||
| return | ||
|
|
||
| model = DummyVAELikeModel(in_features=64, hidden_features=128, out_features=64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer using much smaller configs for tests.
| x = torch.randn(2, 64).to(torch_device) | ||
|
|
||
| with torch.no_grad(): | ||
| for i in range(5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): we can reduce the iteration number to 2.
|
|
||
| x = torch.randn(2, 64).to(torch_device) | ||
|
|
||
| with torch.no_grad(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do test here without iterations like the tests below?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
Fixes #12319
This PR fixes the device mismatch error that occurs when using
block_levelgroup offloading with models containing standalone computational layers (like VAE'spost_quant_convandquant_conv).Problem
When using
block_leveloffloading, the implementation only matchedModuleListandSequentialcontainers, leaving standalone layers (likeConv2d) unmanaged. These layers remained on CPU while their inputs were on CUDA, causing:Block-level offloading logic in
group_offloading.pyonly looked forModuleListandSequentialcontainers when creating groups. Standalone computational layers like the VAE'spost_quant_conv(aConv2dlayer) were not included in any group, so they never received hooks to manage their device placement. This caused them to remain on CPU while their inputs were transferred to CUDA.The block-level offloading logic has been modified to:
ModuleList/Sequentialcontainerunmatched_groupthat gets proper hook managementKey Changes:
_create_groups_for_block_level_offloading()function to collect unmatched computational layers_GO_LC_SUPPORTED_PYTORCH_LAYERSfilterTesting
Tested with:
AutoencoderKL) which has standalonepost_quant_convandquant_convlayersTest Coverage:
test_group_offloading_models_with_standalone_and_deeply_nested_layers- Verifies the fix works with complex model architecturesExpected Behavior After Fix
Before: Block-level offloading fails with device mismatch error when models have standalone computational layers
After: Block-level offloading works correctly with all model architectures, including those with:
Conv2d,Linear, and other computational layersModuleList/SequentialcontainersBefore submitting
Broken group offloading using block_level #12319 (comment)
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul this closes #12319