-
Notifications
You must be signed in to change notification settings - Fork 0
Integrations/wan2.2 s2v #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…date example imports Add unit tests for WanSpeechToVideoPipeline and WanS2VTransformer3DModel and gguf
The previous audio encoding logic was a placeholder. It is now replaced with a `Wav2Vec2ForCTC` model and processor, including the full implementation for processing audio inputs. This involves resampling and aligning audio features with video frames to ensure proper synchronization. Additionally, utility functions for loading audio from files or URLs are added, and the `audio_processor` module is refactored to correctly handle audio data types instead of image types.
Introduces support for audio and pose conditioning, replacing the previous image conditioning mechanism. The model now accepts audio embeddings and pose latents as input. This change also adds two new, mutually exclusive motion processing modules: - `MotionerTransformers`: A transformer-based module for encoding motion. - `FramePackMotioner`: A module that packs frames from different temporal buckets for motion representation. Additionally, an `AudioInjector` module is implemented to fuse audio features into specific transformer blocks using cross-attention.
The `MotionerTransformers` module is removed and its functionality is replaced by a `FramePackMotioner` module and a simplified standard motion processing pipeline. The codebase is refactored to remove the `einops` dependency, replacing `rearrange` operations with standard PyTorch tensor manipulations for better code consistency. Additionally, `AdaLayerNorm` is introduced for improved conditioning, and helper functions for Rotary Positional Embeddings (RoPE) are added (probably temporarily) and refactored for clarity and flexibility. The audio injection mechanism is also updated to align with the new model structure.
Removes the calculation of several unused variables and an unnecessary `deepcopy` operation on the latents tensor. This change also removes the now-unused `deepcopy` import, simplifying the overall logic.
Refactors the `WanS2VTransformer3DModel` for clarity and better handling of various conditioning inputs like audio, pose, and motion. Key changes: - Simplifies the `WanS2VTransformerBlock` by removing projection layers and streamlining the forward pass. - Introduces `after_transformer_block` to cleanly inject audio information after each transformer block, improving code organization. - Enhances the main `forward` method to better process and combine multiple conditioning signals (image, audio, motion) before the transformer blocks. - Adds support for a zero-value timestep to differentiate between image and video latents. - Generalizes temporal embedding logic to support multiple model variations.
Introduces the necessary configurations and state dictionary key mappings to enable the conversion of S2V model checkpoints to the Diffusers format. This includes: - A new transformer configuration for the S2V model architecture, including parameters for audio and pose conditioning. - A comprehensive rename dictionary to map the original S2V layer names to their Diffusers equivalents.
…heads in transformer configuration
- Updated device references in audio encoding and pose video loading to use a unified device variable. - Enhanced image preprocessing to include a resize mode option for better handling of input dimensions. Co-authored-by: Ju Hoon Park <[email protected]>
Added contributor information and enhanced model description.
Added project page link for Wan-S2V model and improved context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for Wan2.2-S2V-14B, a speech-to-video generation model that creates cinematic videos from audio, text prompts, and reference images. The implementation includes a new transformer architecture, pipeline, audio processing utilities, and video/audio merging capabilities.
Key changes:
- New
WanS2VTransformer3DModelfor audio-driven video generation with frame-packing and audio injection - New
WanSpeechToVideoPipelinefor end-to-end speech-to-video generation - Audio processing infrastructure and video/audio merging utilities
Reviewed Changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/diffusers/models/transformers/transformer_wan_s2v.py | Implements the S2V transformer with audio encoding, motion processing, and frame-packing |
| src/diffusers/pipelines/wan/pipeline_wan_s2v.py | Adds the speech-to-video pipeline with audio and pose conditioning |
| src/diffusers/audio_processor.py | Defines audio input types and validation utilities |
| src/diffusers/utils/loading_utils.py | Adds audio loading and video frame sampling utilities |
| src/diffusers/utils/export_utils.py | Implements video/audio merging via FFmpeg |
| src/diffusers/video_processor.py | Extends video preprocessing with new resize modes |
| src/diffusers/image_processor.py | Adds resize_min_center_crop mode for S2V preprocessing |
| scripts/convert_wan_to_diffusers.py | Adds S2V model conversion support |
| tests/pipelines/wan/test_wan_speech_to_video.py | Test suite for the S2V pipeline |
| tests/quantization/gguf/test_gguf.py | GGUF quantization tests for S2V |
| docs/source/en/api/pipelines/wan.md | Documentation for the S2V pipeline |
Comments suppressed due to low confidence (1)
src/diffusers/models/transformers/transformer_wan_s2v.py:1
- Corrected spelling of 'indice' to 'index'.
# Copyright 2025 The Wan Team and The HuggingFace Team. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if crop_type == "paste_center": | ||
| # Paste on canvas, center position | ||
| res = Image.new("RGB", (width, height), color=0) # Black background |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variables src_w and src_h are only defined in the fit_within branch but are used in the paste_center crop type regardless of resize type. This will cause a NameError when resize_type='min_dimension' and crop_type='paste_center'.
| f"Incorrect path or URL. URLs must start with `http://` or `https://`, and {audio} is not a valid path." | ||
| ) | ||
| elif isinstance(audio, numpy.ndarray): | ||
| audio = audio |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line assigns audio to itself, which is redundant and serves no purpose. Consider removing it.
| audio = audio |
| indice = start_frame + i * interval | ||
| if indice >= total_frames: | ||
| break | ||
| sampled_indices.append(int(indice)) |
Copilot
AI
Nov 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'indice' to 'index'.
| indice = start_frame + i * interval | |
| if indice >= total_frames: | |
| break | |
| sampled_indices.append(int(indice)) | |
| index = start_frame + i * interval | |
| if index >= total_frames: | |
| break | |
| sampled_indices.append(int(index)) |
What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.