Skip to content

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Oct 17, 2025

What does this PR do?

Deletes deprecation for the v5 release. cc @yonigozlan and @eustlb to verify that we can really delete them, since it is mostly deprecated stuff from processors

For core maintainers:

  1. The max-size and pad-pixels args can't be deleted due to many models on the hub still using it. Instead I decided to stop raising warning and silently load max size if present. However users cannot anymore use these args when calling the processor
  2. For SDPA attention weights, the warning is reworded to push us to update those models. Usually if model has the new attention API, the warning is raised internally in sdpa_attention, but for old models we have to manually duplicate the waning message

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this so much 🥲, thanks for taking the time to do this @zucchini-nlp !
My only reservation is for max_size, I'll try to scan the hub to see how bad the damage would be right now, and open PRs where necessary

Copy link
Contributor

@eustlb eustlb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking care of that!
General comment:

For every models that had something like

logger.warning_once(
                "MimiModel is using MimiSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
                'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
            )

because of being copied from Gemma, I think we should keep a warning. I've seen your answer above about that, but this is only when output_attentions is setted in the config, not in the forward, no? Therefore, until those implementations rely on ALL_ATTENTION_FUNCTIONS, we still have to handle the warning here to ensure we don't silently skip outputting attentions.

Review from my side:
core models (meaning ones that do not rely on modular)

  • moshi
  • mimi
  • clap
  • seamless_m4t
  • unispeech
  • unispeech_sat
  • wav2vec2
  • wav2vec2_bert
  • whisper

the rest are impacted by modular directly:

  • hubert (wav2vec2)
  • kyutai_speech_to_text (moshi)
  • sew (wav2vec2)
  • sew_d (wav2vec2)
  • wavlm (wav2vec2)
  • data2vec_audio (wav2vec2)

@zucchini-nlp
Copy link
Member Author

Therefore, until those implementations rely on ALL_ATTENTION_FUNCTIONS

Ah, after looking I see that it's not going to raise any warning unless we use ALL_ATTENTION_FUNCTIONS. In that case I think we can reword the warning that "Attention weights will not be returned" and remove the workaround. Users have been warned about SDPA so I think it will not be a big problem

Also it will nudge us to incorporate ALL_ATTENTION_FUNCTIONS. WDYT?

@eustlb
Copy link
Contributor

eustlb commented Oct 17, 2025

Yep, totally agree!

self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model doesn't use positions, I believe it was simply copied from other models without checking

@zucchini-nlp
Copy link
Member Author

Hub down? 😢

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, altclip, beit, blip_2, bloom, bridgetower, clap, conditional_detr, data2vec, deepseek_v2, deformable_detr, detr, grounding_dino

@zucchini-nlp
Copy link
Member Author

Oke, now it is done I think. @yonigozlan @eustlb if you want to take another look, I will also request core mantainer's review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants