Skip to content

Conversation

@rycerzes
Copy link

What does this PR do?

Fixes #12319

This PR fixes the device mismatch error that occurs when using block_level group offloading with models containing standalone computational layers (like VAE's post_quant_conv and quant_conv).

Problem

When using block_level offloading, the implementation only matched ModuleList and Sequential containers, leaving standalone layers (like Conv2d) unmanaged. These layers remained on CPU while their inputs were on CUDA, causing:

RuntimeError: Input type (CUDABFloat16Type) and weight type (CPUBFloat16Type) should be the same

Block-level offloading logic in group_offloading.py only looked for ModuleList and Sequential containers when creating groups. Standalone computational layers like the VAE's post_quant_conv (a Conv2d layer) were not included in any group, so they never received hooks to manage their device placement. This caused them to remain on CPU while their inputs were transferred to CUDA.

The block-level offloading logic has been modified to:

  1. Identify standalone computational layers that don't belong to any ModuleList/Sequential container
  2. Group them into an unmatched_group that gets proper hook management
  3. Apply hooks to ensure proper device placement during forward pass

Key Changes:

  • Updated _create_groups_for_block_level_offloading() function to collect unmatched computational layers
  • Added logic to create a group for standalone layers using the same _GO_LC_SUPPORTED_PYTORCH_LAYERS filter
  • Ensured the unmatched group is properly integrated into the hook chain

Testing

Tested with:

  • SDXL VAE (AutoencoderKL) which has standalone post_quant_conv and quant_conv layers
  • Created test cases for models with both standalone and deeply nested layer structures
  • Confirmed both streaming and non-streaming modes work correctly

Test Coverage:

  • test_group_offloading_models_with_standalone_and_deeply_nested_layers - Verifies the fix works with complex model architectures
  • All existing group offloading tests continue to pass

Expected Behavior After Fix

Before: Block-level offloading fails with device mismatch error when models have standalone computational layers

After: Block-level offloading works correctly with all model architectures, including those with:

  • Standalone Conv2d, Linear, and other computational layers
  • Nested ModuleList/Sequential containers
  • Mixed architectures with both standalone and containerized layers

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul this closes #12319

@sayakpaul
Copy link
Member

@vladmandic would you be interested in testing this out a bit?

@rycerzes
Copy link
Author

@sayakpaul this patch should also fix #12096 since both have the same root cause (standalone conv layers not tracked in block-level offloading), and this handles both Conv2d (SDXL) and Conv3d (Wan).

The fix should work for WanVACEPipeline as well.

@sayakpaul
Copy link
Member

Very cool! Feel free to add some lightweight tests to this PR and the final outputs so that we can also test ourselves.

@sayakpaul sayakpaul requested a review from DN6 November 21, 2025 07:06
@rycerzes
Copy link
Author

Very cool! Feel free to add some lightweight tests to this PR and the final outputs so that we can also test ourselves.

Yes, I added tests in test_group_offloading.py covering the core fix test_block_level_stream_with_invocation_order_different_from_initialization_order plus edge cases for VAE-like models with standalone layers, deeply nested structures, and parameter-only modules.

I also created a standalone test script that validates SDXL VAE and AutoencoderKLWan with both block_level and leaf_level offloading. Output of the script.

Pytest output

pytest tests/hooks/test_group_offloading.py -v
==================================================================== test session starts =====================================================================
platform win32 -- Python 3.13.3, pytest-9.0.1, pluggy-1.6.0 -- D:\Github\oss\diffusers\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\Github\oss\diffusers
configfile: pyproject.toml
plugins: anyio-4.11.0, timeout-2.4.0, xdist-3.8.0, requests-mock-1.10.0
collected 20 items                                                                                                                                            

tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_offloading_with_parameter_only_module_group_0_block_level PASSED              [  5%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_offloading_with_parameter_only_module_group_1_leaf_level PASSED               [ 10%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_stream_with_invocation_order_different_from_initialization_order PASSED       [ 15%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_group_offloading_applied_on_model_offloaded_module PASSED                 [ 20%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_group_offloading_applied_on_sequential_offloaded_module PASSED            [ 25%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_model_offloading_applied_on_group_offloaded_module PASSED                 [ 30%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_sequential_offloading_applied_on_group_offloaded_module PASSED            [ 35%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_streams_used_and_no_accelerator_device PASSED                             [ 40%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_supports_group_offloading_false PASSED                                    [ 45%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_model_with_deeply_nested_blocks PASSED                                                    [ 50%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_model_with_only_standalone_layers PASSED                                                  [ 55%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_multiple_invocations_with_vae_like_model PASSED                                           [ 60%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_nested_container_parameters_offloading PASSED                                             [ 65%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_offloading_forward_pass PASSED                                                            [ 70%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_standalone_conv_layers_with_both_offload_types_0_block_level PASSED                       [ 75%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_standalone_conv_layers_with_both_offload_types_1_leaf_level PASSED                        [ 80%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_vae_like_model_with_standalone_conv_layers PASSED                                         [ 85%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_vae_like_model_without_streams PASSED                                                     [ 90%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_warning_logged_if_group_offloaded_module_moved_to_accelerator PASSED                      [ 95%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_warning_logged_if_group_offloaded_pipe_moved_to_accelerator PASSED                        [100%]

===================================================================== 20 passed in 4.34s =====================================================================

@sayakpaul
Copy link
Member

Thanks for the comprehensive testing! I meant to ask for an even more minimal test script that utilizes group offloading with block_level and generates an output as expected. Something like:

from diffusers import DiffusionPipeline
import torch 

pipe = DiffusionPipeline.from_pretrained("...", torch_dtype=torch.bfloat16)
pipe.transformer.enable_group_offloading(...)

# move rest of the components to CUDA
...

# inference
pipe(...)

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started on the PR.

Comment on lines +608 to +609
# Do NOT add the container name to modules_with_group_offloading here, because we need
# parameters from non-computational sublayers (like GroupNorm) to be gathered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand a bit more on this?

cumulated_absmax, 1e-5, f"Output differences for {name} exceeded threshold: {cumulated_absmax:.5f}"
)

def test_vae_like_model_with_standalone_conv_layers(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leverage an existing model implementations that were reported to be problematic with small configs and use them for testing here.

if torch.device(torch_device).type not in ["cuda", "xpu"]:
return

model = DummyVAELikeModel(in_features=64, hidden_features=128, out_features=64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer using much smaller configs for tests.

x = torch.randn(2, 64).to(torch_device)

with torch.no_grad():
for i in range(5):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit): we can reduce the iteration number to 2.


x = torch.randn(2, 64).to(torch_device)

with torch.no_grad():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do test here without iterations like the tests below?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken group offloading using block_level

3 participants