Integrations/wan2.2 s2v #1

zecloud · 2025-11-10T16:10:26Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…date example imports Add unit tests for WanSpeechToVideoPipeline and WanS2VTransformer3DModel and gguf

The previous audio encoding logic was a placeholder. It is now replaced with a `Wav2Vec2ForCTC` model and processor, including the full implementation for processing audio inputs. This involves resampling and aligning audio features with video frames to ensure proper synchronization. Additionally, utility functions for loading audio from files or URLs are added, and the `audio_processor` module is refactored to correctly handle audio data types instead of image types.

Introduces support for audio and pose conditioning, replacing the previous image conditioning mechanism. The model now accepts audio embeddings and pose latents as input. This change also adds two new, mutually exclusive motion processing modules: - `MotionerTransformers`: A transformer-based module for encoding motion. - `FramePackMotioner`: A module that packs frames from different temporal buckets for motion representation. Additionally, an `AudioInjector` module is implemented to fuse audio features into specific transformer blocks using cross-attention.

The `MotionerTransformers` module is removed and its functionality is replaced by a `FramePackMotioner` module and a simplified standard motion processing pipeline. The codebase is refactored to remove the `einops` dependency, replacing `rearrange` operations with standard PyTorch tensor manipulations for better code consistency. Additionally, `AdaLayerNorm` is introduced for improved conditioning, and helper functions for Rotary Positional Embeddings (RoPE) are added (probably temporarily) and refactored for clarity and flexibility. The audio injection mechanism is also updated to align with the new model structure.

Removes the calculation of several unused variables and an unnecessary `deepcopy` operation on the latents tensor. This change also removes the now-unused `deepcopy` import, simplifying the overall logic.

Refactors the `WanS2VTransformer3DModel` for clarity and better handling of various conditioning inputs like audio, pose, and motion. Key changes: - Simplifies the `WanS2VTransformerBlock` by removing projection layers and streamlining the forward pass. - Introduces `after_transformer_block` to cleanly inject audio information after each transformer block, improving code organization. - Enhances the main `forward` method to better process and combine multiple conditioning signals (image, audio, motion) before the transformer blocks. - Adds support for a zero-value timestep to differentiate between image and video latents. - Generalizes temporal embedding logic to support multiple model variations.

Introduces the necessary configurations and state dictionary key mappings to enable the conversion of S2V model checkpoints to the Diffusers format. This includes: - A new transformer configuration for the S2V model architecture, including parameters for audio and pose conditioning. - A comprehensive rename dictionary to map the original S2V layer names to their Diffusers equivalents.

…heads in transformer configuration

…ttention

…rames_per_chunk

- Updated device references in audio encoding and pose video loading to use a unified device variable. - Enhanced image preprocessing to include a resize mode option for better handling of input dimensions. Co-authored-by: Ju Hoon Park <[email protected]>

Added contributor information and enhanced model description.

Added project page link for Wan-S2V model and improved context.

Copilot

Pull Request Overview

This PR adds support for Wan2.2-S2V-14B, a speech-to-video generation model that creates cinematic videos from audio, text prompts, and reference images. The implementation includes a new transformer architecture, pipeline, audio processing utilities, and video/audio merging capabilities.

Key changes:

New WanS2VTransformer3DModel for audio-driven video generation with frame-packing and audio injection
New WanSpeechToVideoPipeline for end-to-end speech-to-video generation
Audio processing infrastructure and video/audio merging utilities

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/diffusers/models/transformers/transformer_wan_s2v.py	Implements the S2V transformer with audio encoding, motion processing, and frame-packing
src/diffusers/pipelines/wan/pipeline_wan_s2v.py	Adds the speech-to-video pipeline with audio and pose conditioning
src/diffusers/audio_processor.py	Defines audio input types and validation utilities
src/diffusers/utils/loading_utils.py	Adds audio loading and video frame sampling utilities
src/diffusers/utils/export_utils.py	Implements video/audio merging via FFmpeg
src/diffusers/video_processor.py	Extends video preprocessing with new resize modes
src/diffusers/image_processor.py	Adds resize_min_center_crop mode for S2V preprocessing
scripts/convert_wan_to_diffusers.py	Adds S2V model conversion support
tests/pipelines/wan/test_wan_speech_to_video.py	Test suite for the S2V pipeline
tests/quantization/gguf/test_gguf.py	GGUF quantization tests for S2V
docs/source/en/api/pipelines/wan.md	Documentation for the S2V pipeline

Comments suppressed due to low confidence (1)

src/diffusers/models/transformers/transformer_wan_s2v.py:1

Corrected spelling of 'indice' to 'index'.

# Copyright 2025 The Wan Team and The HuggingFace Team. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/source/en/api/pipelines/wan.md

Copilot · 2025-11-10T16:11:21Z

src/diffusers/image_processor.py

+
+        if crop_type == "paste_center":
+            # Paste on canvas, center position
+            res = Image.new("RGB", (width, height), color=0)  # Black background


Variables src_w and src_h are only defined in the fit_within branch but are used in the paste_center crop type regardless of resize type. This will cause a NameError when resize_type='min_dimension' and crop_type='paste_center'.

Copilot · 2025-11-10T16:11:22Z

src/diffusers/utils/loading_utils.py

+                f"Incorrect path or URL. URLs must start with `http://` or `https://`, and {audio} is not a valid path."
+            )
+    elif isinstance(audio, numpy.ndarray):
+        audio = audio


This line assigns audio to itself, which is redundant and serves no purpose. Consider removing it.

Suggested change

audio = audio

Copilot · 2025-11-10T16:11:22Z

src/diffusers/utils/loading_utils.py

+                    indice = start_frame + i * interval
+                    if indice >= total_frames:
+                        break
+                    sampled_indices.append(int(indice))


Corrected spelling of 'indice' to 'index'.

Suggested change

indice = start_frame + i * interval

if indice >= total_frames:

break

sampled_indices.append(int(indice))

index = start_frame + i * interval

if index >= total_frames:

break

sampled_indices.append(int(index))

tolgacangoz added 30 commits August 29, 2025 13:22

temp

a40d776

template2

4be705f

up

cd18245

fix-copies

bbe282f

upp

41fba83

Refactor WanSpeechToVideoPipeline: remove unused image encoder and up…

1a0059f

…date example imports Add unit tests for WanSpeechToVideoPipeline and WanS2VTransformer3DModel and gguf

encoding image to audio

44f4866

up

6d55c93

up

e6f6a22

up

313fea5

Improve Wan S2V pipeline

4ac9339

up

66ec4ff

Refactor latent preparation for S2V

d6ec465

up

a463c09

Removes unused code from the speech-to-video pipeline

323049d

Removes the calculation of several unused variables and an unnecessary `deepcopy` operation on the latents tensor. This change also removes the now-unused `deepcopy` import, simplifying the overall logic.

Add AttentionMixin to WanS2VTransformer3DModel

fe5a626

fix: Update parameter name for audio encoder to num_attention_heads

bb5f10a

simplify

f6fb523

up

dfec152

refactor: Simplify AdaLayerNorm initialization and forward method

21cd65f

fix: Correct parameter value for pose_dim and name for num_attention_…

89b9bcb

…heads in transformer configuration

fix: Update audio injector to use WanTransformerBlock instead of WanA…

167bd23

…ttention

upp

9b6bf4b

feat: Add audio injector attention mappings to transformer key renaming

4bed628

up docs

a112328

tolgacangoz and others added 26 commits September 24, 2025 18:35

Refactor example docstring for aspect ratio resizing and update num_f…

77da3e3

…rames_per_chunk

up docs

9f61f5c

up docs

079dd7d

up test

1c553a1

up tests

dfb99d0

down

b5421f3

Add deterministic audio generation and callback configuration test

0cbe32e

up

685d86e

style

8a5bb49

up

97c2125

Merge branch 'main' into integrations/wan2.2-s2v

83c0a36

Merge branch 'main' into integrations/wan2.2-s2v

2c4d83a

Merge branch 'main' into integrations/wan2.2-s2v

781e9be

Use immutable default values

2575d47

Merge branch 'main' into integrations/wan2.2-s2v

2949ca9

Merge branch 'main' into integrations/wan2.2-s2v

89f923d

Merge branch 'main' into integrations/wan2.2-s2v

3a13398

Merge branch 'main' into integrations/wan2.2-s2v

1fb2c6b

Merge branch 'main' into integrations/wan2.2-s2v

496fa80

Update Wan-S2V model description and contributor info

54cfc71

Added contributor information and enhanced model description.

Update Wan-S2V documentation with project page link

5062ea7

Added project page link for Wan-S2V model and improved context.

Merge branch 'main' into integrations/wan2.2-s2v

61eaf8c

Merge branch 'main' into integrations/wan2.2-s2v

4e27291

Merge branch 'main' into integrations/wan2.2-s2v

187c4fe

Merge branch 'main' into integrations/wan2.2-s2v

bb55dcc

Copilot AI review requested due to automatic review settings November 10, 2025 16:10

Copilot AI reviewed Nov 10, 2025

View reviewed changes

zecloud merged commit d9e11dd into zecloud:main Nov 10, 2025

zecloud mentioned this pull request Nov 19, 2025

Add Wan2.2-S2V: Audio-Driven Cinematic Video Generation huggingface/diffusers#12258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrations/wan2.2 s2v #1

Integrations/wan2.2 s2v #1

Uh oh!

zecloud commented Nov 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Copilot AI Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrations/wan2.2 s2v #1

Integrations/wan2.2 s2v #1

Uh oh!

Conversation

zecloud commented Nov 10, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants