- 
                Notifications
    You must be signed in to change notification settings 
- Fork 30.9k
[chat templates} support loading audio from video #36955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[chat templates} support loading audio from video #36955
Conversation
| Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the  | 
| @Rocketknight1 can you take a look, it can be supported with minimal changes. Unfortunately the tests won't run, I see Phi4 merged but seems like it has not chat template. I will ask Cyril and run tests on Phi4 if it has templates | 
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. | 
| Oke, tested that the template is applied correctly and the inputs are passed to the model with a dummy qwen2-vl processor. Working and ready for review! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this makes sense to me, and I definitely think it's good to support audio/video/audio-from-video in chat templates, similar to the interface provided by a lot of commercial APIs!
Left a couple of nits/comments, mostly centred around the TypedDict classes used for kwargs
| mm_load_kwargs = {} | ||
| for mm_load_key in ChatTemplateLoadKwargs.__annotations__.keys(): | ||
| default_value = getattr(ChatTemplateLoadKwargs, mm_load_key, None) | ||
| value = kwargs.pop(mm_load_key, default_value) | ||
| mm_load_kwargs[mm_load_key] = value | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR, but this code to match kwargs with the TypedDicts feels quite long and confusing, especially if it's used multiple times. Is there some cleaner way to populate a dict with default values that can be overridden by kwargs - maybe use custom classes/dataclasses instead, or add a helper method to the TypedDicts that gets inherited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is getting out of hand. I am planning to refactor this and the new video loading a bit in subsequent PRs. In general it looks now we are over-abusing TypedDict for what it usually is not used, so I will consider doing something else
* add audio from video * typos * delete print * comments
* add audio from video * typos * delete print * comments
What does this PR do?
If user indicates
load_audio_from_video, we will extract audio part of the video and use it as input audio. Most any-to-text models are used this way, according to their demo inference scriptsNeeded for Qwen-Omni release