Detect and fix most `_init_weights()` issues - make it work for composite models #37070

Cyrilvallez · 2025-03-28T09:59:27Z

What does this PR do?

This is a follow-up of #36963.

This PR makes _init_weights work seamlessly with composite models. Until this point, composite models would only use the _init_weights of the outer-most PreTrainedModel wrapper, leading to errors or skipped modules. Now, sub-models are correctly initialized according to their own _init_weights, without any overhead. This is increasingly important as most recent models are now multimodal.
Without this change, every composite model would have to recurse a second time on all sub-models explicitly in the outer-most _init_weights, which is extremely error prone and inefficient. E.g., we would need to do one or the other of the following in the outer-most _init_weights:

# FIRST BAD OPTION

def _init_weights(self, module):
    std = self.config.initializer_range
       
    # for each module in the model, check the whole module list of the submodel (very inefficient)
    if module in self.vision_tower.modules():
        self.vision_tower._init_weights(module)

    # similar for the other sub-model
    elif module in self.language_model.modules():
        self.language_model._init_weights(module)

    # usual init block for only the modules external to the sub-models
    elif isinstance(module, nn.Linear):
        ...

# OR EQUALLY INEFFICIENT

def _init_weights(self, module):
    std = self.config.initializer_range
       
    # Here, as `apply` is depth-first graph traversal, every module will be initialized a first time, then re-initialized
    # a second time (extremely inefficient as well)
    if module is self.vision_tower:
        self.vision_tower.apply(self.vision_tower._init_weights)

    # similar for the other sub-model
    elif module is self.language_model:
        self.language_model.apply(self.language_model._init_weights)

    # usual init block for only the modules external to the sub-models
    elif isinstance(module, nn.Linear):
        ...

This PR allows to simply do

def _init_weights(self, module):
    std = self.config.initializer_range

    # usual init block for only the modules external to the sub-models
    if isinstance(module, nn.Linear):
        ...

and have all submodels correctly initialized automatically.

Also, enforce torch.no_grad() for initialization, which was not the case before and would slow down the process.

Finally, fix the _init_weights of a LOT of models, the most important ones (the most recent ones, and the ones with the flag _supports_cache_class=True) for now.
The reason not to do them all is simply that there are too much to fix. Almost all models in the library have broken _ init_weights 🙃
We'll patch incrementally. In the meantime, the added test will enforce that new models are correct.

github-actions · 2025-03-28T09:59:37Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

ydshieh · 2025-03-28T10:29:27Z

Before a huge refactorization, could you review this one and hopefully we can merge it ?🙏

HuggingFaceDocBuilderDev · 2025-03-31T16:10:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez

cc @ArthurZucker, I highlighted most important changes. Other changes are just the _init_weights fixing on all the models

Cyrilvallez · 2025-04-02T15:37:08Z

src/transformers/modeling_utils.py

+    @torch.no_grad()
+    def initialize_weights(self):
+        """
+        This is equivalent to calling `self.apply(self._initialize_weights)`, but correctly handles composite models.
+        This function dynamically dispatches the correct `init_weights` function to the modules as we advance in the
+        module graph along the recursion. It can handle an arbitrary number of sub-models. Without it, every composite
+        model would have to recurse a second time on all sub-models explicitly in the outer-most `_init_weights`, which
+        is extremely error prone and inefficient.
+
+        Note that the `torch.no_grad()` decorator is very important as well, as most of our `_init_weights` do not use
+        `torch.nn.init` functions (which are all no_grad by default), but simply do in-place ops such as
+        `module.weight.data.zero_()`.
+        """
+        if not hasattr(torch.nn.Module, "smart_apply"):
+            # This function is equivalent to `torch.nn.Module.apply`, except that it dynamically adjust the function
+            # to apply as we go down the graph
+            def smart_apply(self, fn):
+                for module in self.children():
+                    # We found a sub-model: recursively dispatch its own init function now!
+                    if hasattr(module, "_init_weights"):
+                        module.smart_apply(module._initialize_weights)
+                    else:
+                        module.smart_apply(fn)
+                fn(self)
+                return self
+
+            torch.nn.Module.smart_apply = smart_apply
+
+        # Let the magic happen with this simple call
+        self.smart_apply(self._initialize_weights)


This is the most important change to review @ArthurZucker. It's the most efficient and elegant way to handle it, as we only need to traverse modules once. However, it requires to hot-patch torch.nn.Module, which is a bummer but fine IMO.
Other options to avoid doing so all require to traverse the modules several times (at least 2 times) which is less efficient.

Cyrilvallez · 2025-04-02T15:38:33Z

tests/test_modeling_common.py

+            filename = inspect.getfile(model_class)
+            # No easy way to get model addition date -> check copyright year on top of file
+            with open(filename) as file:
+                source_code = file.read()
+            addition_year = 0  # if we cannot find it, set it to 0 (i.e. oldest)
+            if match_object := re.search(r"^# Copyright (\d{4})", source_code, re.MULTILINE | re.IGNORECASE):
+                addition_year = int(match_object.group(1))
+
+            # For now, skip everything older than 2024 and "important models" (too much models to patch otherwise)
+            # Use `supports_cache_class` as a proxy to judge "important" models in order to prioritize them
+            # TODO: relax this as we patch more and more models
+            if addition_year < 2025 and not model_class._supports_cache_class:
+                self.skipTest(reason=f"{model_class} is not a priorited model for now.")


Because there are too many models to patch otherwise, I'm just enforcing it for most recent/most important ones for now. Having the test like this will enforce that new models have correct init schemes, while we patch older models. Just using copyright date as a proxy of added time as it's the easiest way that came to mind

Cyrilvallez · 2025-04-14T12:01:16Z

run-slow: llama, mistral, mistral3, qwen2_5_vl

github-actions · 2025-04-14T12:02:35Z

This comment contains run-slow, running the specified jobs: This comment contains run-slow, running the specified jobs:

models: ['models/llama', 'models/mistral', 'models/mistral3', 'models/qwen2_5_vl']
quantizations: [] ...

Cyrilvallez · 2025-04-14T14:18:56Z

Slow tests are similar as main, other tests are hub timeouts and flaky test for Qwen2.5 Omni
Merging

…site models (huggingface#37070) * Update test_modeling_common.py * Fix Llama and its modular children * Update test_modeling_common.py * qwen3 * first try at prioritizing models * Update test_modeling_common.py * Update test_modeling_common.py * Update test_modeling_common.py * test * fix * fix * more models * more * more * more * smarter init for composite models! * fix post rebase * smol * fix missing args * more * typo * Super elegant and efficient init for submodels * Update modeling_utils.py * style * last fixes * cleanup * finalize cleanup * CIs * improve docstring * Update modeling_utils.py * llama4 * style * CIs * style * add dpt * granite speech * qwen 2.5 omni * better fix * Parse the config file instead * CIs

ArthurZucker

LGTM, not sure how much more efficient it is, but it should be a lot! 🤗

…site models (huggingface#37070) * Update test_modeling_common.py * Fix Llama and its modular children * Update test_modeling_common.py * qwen3 * first try at prioritizing models * Update test_modeling_common.py * Update test_modeling_common.py * Update test_modeling_common.py * test * fix * fix * more models * more * more * more * smarter init for composite models! * fix post rebase * smol * fix missing args * more * typo * Super elegant and efficient init for submodels * Update modeling_utils.py * style * last fixes * cleanup * finalize cleanup * CIs * improve docstring * Update modeling_utils.py * llama4 * style * CIs * style * add dpt * granite speech * qwen 2.5 omni * better fix * Parse the config file instead * CIs

github-actions bot marked this pull request as draft March 28, 2025 09:59

Cyrilvallez marked this pull request as ready for review March 28, 2025 10:06

github-actions bot requested a review from ydshieh March 28, 2025 10:06

Cyrilvallez mentioned this pull request Mar 28, 2025

Remove low_cpu_mem_usage and _fast_init #36963

Merged

Cyrilvallez force-pushed the fix-all-init-weights branch from 8574b67 to 759c5c8 Compare March 31, 2025 15:38

Cyrilvallez force-pushed the fix-all-init-weights branch 2 times, most recently from c84f00b to 39ddd6e Compare April 2, 2025 13:23

Cyrilvallez changed the title ~~Detect and fix all _init_weights() issues~~ Detect and fix most _init_weights() issues - make it work for composite models Apr 2, 2025

Cyrilvallez commented Apr 2, 2025

View reviewed changes

Cyrilvallez mentioned this pull request Apr 7, 2025

fix derived berts _init_weights #37341

Merged

Cyrilvallez force-pushed the fix-all-init-weights branch from 13ee1da to 8e55a6f Compare April 8, 2025 13:44

ydshieh removed their request for review April 14, 2025 09:25

Cyrilvallez force-pushed the fix-all-init-weights branch from 6ef2c6d to e5d5ecc Compare April 14, 2025 10:21

Cyrilvallez added 14 commits April 14, 2025 13:03

Update test_modeling_common.py

47c8e54

Fix Llama and its modular children

abe0488

Update test_modeling_common.py

4749c0d

qwen3

49d7625

first try at prioritizing models

80e52d7

Update test_modeling_common.py

155969d

Update test_modeling_common.py

9a15450

Update test_modeling_common.py

62725b4

test

b5eb6cd

fix

d0a016a

fix

dceab88

more models

3251852

more

15bcf97

more

997ba7e

Cyrilvallez added 13 commits April 14, 2025 13:03

style

bf9b49f

last fixes

e4141c0

cleanup

a04f7d5

finalize cleanup

8ed50ab

CIs

a9c303e

improve docstring

23eb8c1

Update modeling_utils.py

1aa8914

llama4

28f8657

style

ce281b8

CIs

c135488

style

f41f9cc

add dpt

6f6364c

granite speech

ce665b8

Cyrilvallez force-pushed the fix-all-init-weights branch from bc884ae to ce665b8 Compare April 14, 2025 11:04

Cyrilvallez added 4 commits April 14, 2025 13:32

qwen 2.5 omni

e3ccb5f

better fix

1034c8c

Parse the config file instead

25d47e4

CIs

4400c52

Cyrilvallez merged commit 4e53840 into main Apr 14, 2025
17 of 22 checks passed

Cyrilvallez deleted the fix-all-init-weights branch April 14, 2025 14:19

ArthurZucker reviewed Apr 28, 2025

View reviewed changes

molbap mentioned this pull request Apr 28, 2025

Samhq model addition #35147

Merged

ydshieh mentioned this pull request May 26, 2025

update gemma tests #38384

Merged

Tcc0403 mentioned this pull request Jun 1, 2025

patching LigerRMSNorm with partial arguments given would cause error in _init_weights() linkedin/Liger-Kernel#739

Closed

Cyrilvallez mentioned this pull request Jun 20, 2025

Fix initialization of OneFormer #38901

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect and fix most `_init_weights()` issues - make it work for composite models #37070

Detect and fix most `_init_weights()` issues - make it work for composite models #37070

Uh oh!

Cyrilvallez commented Mar 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

ydshieh commented Mar 28, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2025

Uh oh!

Cyrilvallez left a comment

Uh oh!

Cyrilvallez Apr 2, 2025

Uh oh!

Cyrilvallez Apr 2, 2025 •

edited

Loading

Uh oh!

Cyrilvallez commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

Cyrilvallez commented Apr 14, 2025

Uh oh!

Uh oh!

ArthurZucker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Detect and fix most _init_weights() issues - make it work for composite models #37070

Detect and fix most _init_weights() issues - make it work for composite models #37070

Uh oh!

Conversation

Cyrilvallez commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

ydshieh commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2025

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

Cyrilvallez commented Apr 14, 2025

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Detect and fix most `_init_weights()` issues - make it work for composite models #37070

Detect and fix most `_init_weights()` issues - make it work for composite models #37070

Cyrilvallez commented Mar 28, 2025 •

edited

Loading

ydshieh commented Mar 28, 2025 •

edited

Loading

Cyrilvallez Apr 2, 2025 •

edited

Loading