Skip to content

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

If user indicates load_audio_from_video, we will extract audio part of the video and use it as input audio. Most any-to-text models are used this way, according to their demo inference scripts

Needed for Qwen-Omni release

Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers.

@github-actions github-actions bot marked this pull request as draft March 25, 2025 10:19
@zucchini-nlp zucchini-nlp marked this pull request as ready for review March 25, 2025 10:19
@zucchini-nlp
Copy link
Member Author

@Rocketknight1 can you take a look, it can be supported with minimal changes. Unfortunately the tests won't run, I see Phi4 merged but seems like it has not chat template. I will ask Cyril and run tests on Phi4 if it has templates

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

Oke, tested that the template is applied correctly and the inputs are passed to the model with a dummy qwen2-vl processor. Working and ready for review!

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense to me, and I definitely think it's good to support audio/video/audio-from-video in chat templates, similar to the interface provided by a lot of commercial APIs!

Left a couple of nits/comments, mostly centred around the TypedDict classes used for kwargs

Comment on lines +1330 to +1334
mm_load_kwargs = {}
for mm_load_key in ChatTemplateLoadKwargs.__annotations__.keys():
default_value = getattr(ChatTemplateLoadKwargs, mm_load_key, None)
value = kwargs.pop(mm_load_key, default_value)
mm_load_kwargs[mm_load_key] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to this PR, but this code to match kwargs with the TypedDicts feels quite long and confusing, especially if it's used multiple times. Is there some cleaner way to populate a dict with default values that can be overridden by kwargs - maybe use custom classes/dataclasses instead, or add a helper method to the TypedDicts that gets inherited?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is getting out of hand. I am planning to refactor this and the new video loading a bit in subsequent PRs. In general it looks now we are over-abusing TypedDict for what it usually is not used, so I will consider doing something else

@zucchini-nlp zucchini-nlp mentioned this pull request Mar 27, 2025
5 tasks
@zucchini-nlp zucchini-nlp merged commit e97c760 into huggingface:main Mar 27, 2025
20 checks passed
zucchini-nlp added a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
* add audio from video

* typos

* delete print

* comments
soghomon-b pushed a commit to soghomon-b/transformers that referenced this pull request Aug 24, 2025
* add audio from video

* typos

* delete print

* comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants