Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP #31161

younesbelkada · 2024-05-31T10:24:36Z

What does this PR do?

Click to see the snippet (make sure to run `accelerate config` and select FSDP options before hand and run `accelerate launch script.py`)

from functools import partial
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Accelerator

# verify we have FSDP activation support ready by importing:
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    checkpoint_wrapper,
    CheckpointImpl,
    apply_activation_checkpointing,
)

from transformers.models.llama.modeling_llama import LlamaDecoderLayer

model_id = "HuggingFaceM4/tiny-random-Llama3ForCausalLM"

model = AutoModelForCausalLM.from_pretrained(model_id)

model.train()
model.gradient_checkpointing_enable()

accelerator = Accelerator()
model = accelerator.prepare(model)

check_fn = lambda submodule: isinstance(submodule, LlamaDecoderLayer)

non_reentrant_wrapper = partial(
    checkpoint_wrapper,
    offload_to_cpu=False,
    checkpoint_impl=CheckpointImpl.NO_REENTRANT,
)

apply_activation_checkpointing(
    model, checkpoint_wrapper_fn=non_reentrant_wrapper, check_fn=check_fn
)

print(model)

rand_input = torch.LongTensor([[0, 1, 0, 1]]).to(0)

model(rand_input)

#30743 introduced a breaking change for users that use Llama-based models + FSDP + activation checkpointing with FSDP.

Before #30743 - we were able to pass arbitrary kwargs within Llama modules that were silently ignored. When doing FSDP + activation checkpointing, the target gradient checkpointing classes are wrapped in a new class, and additional kwargs are passed along that class forward pass

The script above used to work for transformers <= 4.40.0 and does not work anymore due to #30743 , re-intoducing kwargs in the foward method signature fixes the bug

cc @amyeroberts

HuggingFaceDocBuilderDev · 2024-05-31T10:54:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for fixing and apologies for breaking this!

Some questions before we can merge

Would it make sense to add a test to make sure we don't accidentally break this again?
Having **kwargs in the forward method isn't standard amongst transformers models. Is there something special about these models which need this for FSDP? If not, should we be adding to other models?
Is there an alternative to using this injection? Relying on kwargs being passed isn't ideal

younesbelkada · 2024-06-04T09:03:44Z

Thanks !

Would it make sense to add a test to make sure we don't accidentally break this again?

Yes, I'll add a test in this PR to test this behavior and catch bugs in the future!

Having **kwargs in the forward method isn't standard amongst transformers models. Is there something special about these models which need this for FSDP? If not, should we be adding to other models?

Yes agreed, I think we should add it to all 'most-used' models. FSDP is useful for large models, so I would say we should add it for LLMs (llama, gemma, mistral, mixtral, gpt-neo, etc.) to make things consistent. Happy to do that within this PR !

Is there an alternative to using this injection? Relying on kwargs being passed isn't ideal

I am not sure, this seems to be something internal to FSDP + CPU offloading, I don't think we can find a workaround to this :/ for me since it used to work before, it should be still working for future transformers versions to ensure BC. What do you think?

amyeroberts · 2024-06-04T09:06:28Z

Yes, I'll add a test in this PR to test this behavior and catch bugs in the future!
Yes agreed, I think we should add it to all 'most-used' models. FSDP is useful for large models, so I would say we should add it for LLMs (llama, gemma, mistral, mixtral, gpt-neo, etc.) to make things consistent. Happy to do that within this PR !

Awesome - thank you!

I am not sure, this seems to be something internal to FSDP + CPU offloading, I don't think we can find a workaround to this :/ for me since it used to work before, it should be still working for future transformers versions to ensure BC. What do you think?

Make sense - let's leave as-is :)

amyeroberts · 2024-06-26T12:58:39Z

@younesbelkada I'm really sorry I missed the rerequest for review. ~~I don't have permissions to make changes, so copied the branch here: #31638 and sync with main.~~ I couldn't push working locally but could change through the editor

src/transformers/models/phi/modeling_phi.py

Remove script - FSDP + CPU offloading it tested in the test suite

src/transformers/models/phi3/modeling_phi3.py

src/transformers/models/gemma/modeling_gemma.py

src/transformers/models/llama/modeling_llama.py

src/transformers/models/mistral/modeling_mistral.py

src/transformers/models/olmo/modeling_olmo.py

amyeroberts

Thanks for fixing @younesbelkada, and apologies for the delay in reviewing.

I was able to make the necessary updates to resolve conflicts with main through the online editor. As this was just merging new input argument it didn't affect the structure of the PR. I did remove the testing_utils scripts (which I would have asked you to remove in a review :) )

fix llama fsdp

a2ba705

younesbelkada requested a review from amyeroberts May 31, 2024 10:24

younesbelkada mentioned this pull request May 31, 2024

Llama Attention Call should not pass **kwargs #30523

Closed

4 tasks

fixup

5188106

younesbelkada requested a review from LysandreJik May 31, 2024 10:37

amyeroberts reviewed Jun 3, 2024

View reviewed changes

younesbelkada added 7 commits June 4, 2024 16:29

Merge remote-tracking branch 'origin/main' into HEAD

51e0513

adding FSDP tests for CPU offloading

91c7b0d

fixes

d04ae5b

fix tests

ac7977c

fix tests

8e65359

add it for mixtral

705afe4

propagate the changes on other models

3ce5178

younesbelkada requested a review from amyeroberts June 4, 2024 16:18

amyeroberts mentioned this pull request Jun 26, 2024

Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP #31638

Closed

Merge branch 'main' into fix-llama-fsdp

bb4c47a