Fix broken group offloading with block_level for models with standalone layers #12692

rycerzes · 2025-11-21T06:31:44Z

What does this PR do?

This PR fixes the device mismatch error that occurs when using block_level group offloading with models containing standalone computational layers (like VAE's post_quant_conv and quant_conv).

Problem

When using block_level offloading, the implementation only matched ModuleList and Sequential containers, leaving standalone layers (like Conv2d) unmanaged. These layers remained on CPU while their inputs were on CUDA, causing:

RuntimeError: Input type (CUDABFloat16Type) and weight type (CPUBFloat16Type) should be the same

Block-level offloading logic in group_offloading.py only looked for ModuleList and Sequential containers when creating groups. Standalone computational layers like the VAE's post_quant_conv (a Conv2d layer) were not included in any group, so they never received hooks to manage their device placement. This caused them to remain on CPU while their inputs were transferred to CUDA.

The block-level offloading logic has been modified to:

Identify standalone computational layers that don't belong to any ModuleList/Sequential container
Group them into an unmatched_group that gets proper hook management
Apply hooks to ensure proper device placement during forward pass

Key Changes:

Updated _create_groups_for_block_level_offloading() function to collect unmatched computational layers
Added logic to create a group for standalone layers using the same _GO_LC_SUPPORTED_PYTORCH_LAYERS filter
Ensured the unmatched group is properly integrated into the hook chain

Testing

Tested with:

SDXL VAE (AutoencoderKL) which has standalone post_quant_conv and quant_conv layers
Created test cases for models with both standalone and deeply nested layer structures
Confirmed both streaming and non-streaming modes work correctly

Test Coverage:

test_group_offloading_models_with_standalone_and_deeply_nested_layers - Verifies the fix works with complex model architectures
All existing group offloading tests continue to pass

Expected Behavior After Fix

Before: Block-level offloading fails with device mismatch error when models have standalone computational layers

After: Block-level offloading works correctly with all model architectures, including those with:

Standalone Conv2d, Linear, and other computational layers
Nested ModuleList/Sequential containers
Mixed architectures with both standalone and containerized layers

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Broken group offloading using block_level #12319 (comment)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul this closes #12319

…lock-level offloading

…vel offloading

sayakpaul · 2025-11-21T06:36:38Z

@vladmandic would you be interested in testing this out a bit?

rycerzes · 2025-11-21T06:50:23Z

@sayakpaul this patch should also fix #12096 since both have the same root cause (standalone conv layers not tracked in block-level offloading), and this handles both Conv2d (SDXL) and Conv3d (Wan).

The fix should work for WanVACEPipeline as well.

sayakpaul · 2025-11-21T07:06:23Z

Very cool! Feel free to add some lightweight tests to this PR and the final outputs so that we can also test ourselves.

rycerzes · 2025-11-21T09:59:04Z

Very cool! Feel free to add some lightweight tests to this PR and the final outputs so that we can also test ourselves.

Yes, I added tests in test_group_offloading.py covering the core fix test_block_level_stream_with_invocation_order_different_from_initialization_order plus edge cases for VAE-like models with standalone layers, deeply nested structures, and parameter-only modules.

I also created a standalone test script that validates SDXL VAE and AutoencoderKLWan with both block_level and leaf_level offloading. Output of the script.

Pytest output

pytest tests/hooks/test_group_offloading.py -v
==================================================================== test session starts =====================================================================
platform win32 -- Python 3.13.3, pytest-9.0.1, pluggy-1.6.0 -- D:\Github\oss\diffusers\.venv\Scripts\python.exe
cachedir: .pytest_cache
rootdir: D:\Github\oss\diffusers
configfile: pyproject.toml
plugins: anyio-4.11.0, timeout-2.4.0, xdist-3.8.0, requests-mock-1.10.0
collected 20 items                                                                                                                                            

tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_offloading_with_parameter_only_module_group_0_block_level PASSED              [  5%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_offloading_with_parameter_only_module_group_1_leaf_level PASSED               [ 10%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_block_level_stream_with_invocation_order_different_from_initialization_order PASSED       [ 15%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_group_offloading_applied_on_model_offloaded_module PASSED                 [ 20%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_group_offloading_applied_on_sequential_offloaded_module PASSED            [ 25%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_model_offloading_applied_on_group_offloaded_module PASSED                 [ 30%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_sequential_offloading_applied_on_group_offloaded_module PASSED            [ 35%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_streams_used_and_no_accelerator_device PASSED                             [ 40%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_error_raised_if_supports_group_offloading_false PASSED                                    [ 45%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_model_with_deeply_nested_blocks PASSED                                                    [ 50%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_model_with_only_standalone_layers PASSED                                                  [ 55%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_multiple_invocations_with_vae_like_model PASSED                                           [ 60%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_nested_container_parameters_offloading PASSED                                             [ 65%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_offloading_forward_pass PASSED                                                            [ 70%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_standalone_conv_layers_with_both_offload_types_0_block_level PASSED                       [ 75%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_standalone_conv_layers_with_both_offload_types_1_leaf_level PASSED                        [ 80%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_vae_like_model_with_standalone_conv_layers PASSED                                         [ 85%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_vae_like_model_without_streams PASSED                                                     [ 90%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_warning_logged_if_group_offloaded_module_moved_to_accelerator PASSED                      [ 95%]
tests/hooks/test_group_offloading.py::GroupOffloadTests::test_warning_logged_if_group_offloaded_pipe_moved_to_accelerator PASSED                        [100%]

===================================================================== 20 passed in 4.34s =====================================================================

sayakpaul · 2025-11-21T10:55:02Z

Thanks for the comprehensive testing! I meant to ask for an even more minimal test script that utilizes group offloading with block_level and generates an output as expected. Something like:

from diffusers import DiffusionPipeline
import torch 

pipe = DiffusionPipeline.from_pretrained("...", torch_dtype=torch.bfloat16)
pipe.transformer.enable_group_offloading(...)

# move rest of the components to CUDA
...

# inference
pipe(...)

sayakpaul

Thanks for getting started on the PR.

sayakpaul · 2025-11-21T11:01:26Z

src/diffusers/hooks/group_offloading.py

+                # Do NOT add the container name to modules_with_group_offloading here, because we need
+                # parameters from non-computational sublayers (like GroupNorm) to be gathered


Could you expand a bit more on this?

sayakpaul · 2025-11-21T11:04:05Z

tests/hooks/test_group_offloading.py

                cumulated_absmax, 1e-5, f"Output differences for {name} exceeded threshold: {cumulated_absmax:.5f}"
            )
+
+    def test_vae_like_model_with_standalone_conv_layers(self):


We can leverage an existing model implementations that were reported to be problematic with small configs and use them for testing here.

sayakpaul · 2025-11-21T11:04:20Z

tests/hooks/test_group_offloading.py

+        if torch.device(torch_device).type not in ["cuda", "xpu"]:
+            return
+
+        model = DummyVAELikeModel(in_features=64, hidden_features=128, out_features=64)


Prefer using much smaller configs for tests.

sayakpaul · 2025-11-21T11:05:53Z

tests/hooks/test_group_offloading.py

+        x = torch.randn(2, 64).to(torch_device)
+
+        with torch.no_grad():
+            for i in range(5):


(nit): we can reduce the iteration number to 2.

sayakpaul · 2025-11-21T11:06:54Z

tests/hooks/test_group_offloading.py

+
+        x = torch.randn(2, 64).to(torch_device)
+
+        with torch.no_grad():


Why do test here without iterations like the tests below?

HuggingFaceDocBuilderDev · 2025-11-21T11:16:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rycerzes added 2 commits November 21, 2025 11:43

fix: group offloading to support standalone computational layers in b…

ad1fc37

…lock-level offloading

test: for models with standalone and deeply nested layers in block-le…

59b6b67

…vel offloading

sayakpaul mentioned this pull request Nov 21, 2025

The Diffusers MVP 🚀 #12635

Open

rycerzes mentioned this pull request Nov 21, 2025

WanVACEPipeline - doesn't work with apply_group_offloading #12096

Open

sayakpaul requested a review from DN6 November 21, 2025 07:06

sayakpaul reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix broken group offloading with block_level for models with standalone layers #12692

Fix broken group offloading with block_level for models with standalone layers #12692

rycerzes commented Nov 21, 2025

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

rycerzes commented Nov 21, 2025

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

rycerzes commented Nov 21, 2025

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

sayakpaul Nov 21, 2025

Uh oh!

sayakpaul Nov 21, 2025

Uh oh!

sayakpaul Nov 21, 2025

Uh oh!

sayakpaul Nov 21, 2025

Uh oh!

sayakpaul Nov 21, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Do NOT add the container name to modules_with_group_offloading here, because we need
		# parameters from non-computational sublayers (like GroupNorm) to be gathered


		x = torch.randn(2, 64).to(torch_device)

		with torch.no_grad():

Fix broken group offloading with block_level for models with standalone layers #12692

Are you sure you want to change the base?

Fix broken group offloading with block_level for models with standalone layers #12692

Conversation

rycerzes commented Nov 21, 2025

What does this PR do?

Problem

Testing

Expected Behavior After Fix

Before submitting

Who can review?

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

rycerzes commented Nov 21, 2025

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

rycerzes commented Nov 21, 2025

Uh oh!

sayakpaul commented Nov 21, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants