[chat templates} support loading audio from video #36955

zucchini-nlp · 2025-03-25T10:19:24Z

What does this PR do?

If user indicates load_audio_from_video, we will extract audio part of the video and use it as input audio. Most any-to-text models are used this way, according to their demo inference scripts

Needed for Qwen-Omni release

github-actions · 2025-03-25T10:19:40Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers.

zucchini-nlp · 2025-03-25T10:22:27Z

@Rocketknight1 can you take a look, it can be supported with minimal changes. Unfortunately the tests won't run, I see Phi4 merged but seems like it has not chat template. I will ask Cyril and run tests on Phi4 if it has templates

HuggingFaceDocBuilderDev · 2025-03-25T10:44:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-03-25T18:51:15Z

Oke, tested that the template is applied correctly and the inputs are passed to the model with a dummy qwen2-vl processor. Working and ready for review!

Rocketknight1

Overall this makes sense to me, and I definitely think it's good to support audio/video/audio-from-video in chat templates, similar to the interface provided by a lot of commercial APIs!

Left a couple of nits/comments, mostly centred around the TypedDict classes used for kwargs

src/transformers/processing_utils.py

Rocketknight1 · 2025-03-26T19:49:57Z

src/transformers/processing_utils.py

+        mm_load_kwargs = {}
+        for mm_load_key in ChatTemplateLoadKwargs.__annotations__.keys():
+            default_value = getattr(ChatTemplateLoadKwargs, mm_load_key, None)
+            value = kwargs.pop(mm_load_key, default_value)
+            mm_load_kwargs[mm_load_key] = value


Not really related to this PR, but this code to match kwargs with the TypedDicts feels quite long and confusing, especially if it's used multiple times. Is there some cleaner way to populate a dict with default values that can be overridden by kwargs - maybe use custom classes/dataclasses instead, or add a helper method to the TypedDicts that gets inherited?

yeah, this is getting out of hand. I am planning to refactor this and the new video loading a bit in subsequent PRs. In general it looks now we are over-abusing TypedDict for what it usually is not used, so I will consider doing something else

* add audio from video * typos * delete print * comments

add audio from video

5239205

github-actions bot marked this pull request as draft March 25, 2025 10:19

zucchini-nlp marked this pull request as ready for review March 25, 2025 10:19

zucchini-nlp requested a review from Rocketknight1 March 25, 2025 10:19

zucchini-nlp added 2 commits March 25, 2025 19:48

typos

1592d8b

delete print

a0afe48

Rocketknight1 approved these changes Mar 26, 2025

View reviewed changes

comments

defbcf0

zucchini-nlp mentioned this pull request Mar 27, 2025

Add Qwen2.5-Omni #36752

Merged

5 tasks

Merge branch 'main' into audio-video-chat-templates

53393c8

zucchini-nlp merged commit e97c760 into huggingface:main Mar 27, 2025
20 checks passed

zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025

[chat templates} support loading audio from video (huggingface#36955)

4f01dbb

* add audio from video * typos * delete print * comments

soghomon-b pushed a commit to soghomon-b/transformers that referenced this pull request Aug 24, 2025

[chat templates} support loading audio from video (huggingface#36955)

ee8c5aa

* add audio from video * typos * delete print * comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[chat templates} support loading audio from video #36955

[chat templates} support loading audio from video #36955

Uh oh!

zucchini-nlp commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

zucchini-nlp commented Mar 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 25, 2025

Uh oh!

zucchini-nlp commented Mar 25, 2025

Uh oh!

Rocketknight1 left a comment

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 Mar 26, 2025

Uh oh!

zucchini-nlp Mar 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[chat templates} support loading audio from video #36955

[chat templates} support loading audio from video #36955

Uh oh!

Conversation

zucchini-nlp commented Mar 25, 2025

What does this PR do?

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

zucchini-nlp commented Mar 25, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Mar 25, 2025

Uh oh!

zucchini-nlp commented Mar 25, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants