-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Add Wan2.2-S2V: Audio-Driven Cinematic Video Generation #12258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tolgacangoz
wants to merge
149
commits into
huggingface:main
Choose a base branch
from
tolgacangoz:integrations/wan2.2-s2v
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 148 commits
Commits
Show all changes
149 commits
Select commit
Hold shift + click to select a range
a40d776
temp
tolgacangoz 4be705f
template2
tolgacangoz cd18245
up
tolgacangoz bbe282f
fix-copies
tolgacangoz 41fba83
upp
tolgacangoz 1a0059f
Refactor WanSpeechToVideoPipeline: remove unused image encoder and up…
tolgacangoz 44f4866
encoding image to audio
tolgacangoz 933b618
Refactor Wan Speech-to-Video audio encoding
tolgacangoz 6d55c93
up
tolgacangoz e6f6a22
up
tolgacangoz 313fea5
up
tolgacangoz 4ac9339
Improve Wan S2V pipeline
tolgacangoz 66ec4ff
up
tolgacangoz d6ec465
Refactor latent preparation for S2V
tolgacangoz a463c09
up
tolgacangoz 65191a9
feat: Add audio, pose, and advanced motion conditioning
tolgacangoz 7925229
Refactor `WanS2V` transformer and introduce FramePack motioner
tolgacangoz 323049d
Removes unused code from the speech-to-video pipeline
tolgacangoz 6515b23
Refactor WanS2VTransformer and improve conditioning
tolgacangoz fe5a626
Add `AttentionMixin` to `WanS2VTransformer3DModel`
tolgacangoz bb5f10a
fix: Update parameter name for audio encoder to `num_attention_heads`
tolgacangoz bb5f4c9
feat: Improve support for S2V model conversion
tolgacangoz f6fb523
simplify
tolgacangoz dfec152
up
tolgacangoz 21cd65f
refactor: Simplify AdaLayerNorm initialization and forward method
tolgacangoz 89b9bcb
fix: Correct parameter value for pose_dim and name for num_attention_…
tolgacangoz 167bd23
fix: Update audio injector to use WanTransformerBlock instead of WanA…
tolgacangoz 9b6bf4b
upp
tolgacangoz 4bed628
feat: Add audio injector attention mappings to transformer key renaming
tolgacangoz a112328
up docs
tolgacangoz 0685646
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz c798d93
Adapt the `WanS2VTransformerBlock` to handle the new `temb` format, w…
tolgacangoz d612c41
style
tolgacangoz f1ef8fa
Simplify
tolgacangoz 30be7e8
Remove unused audio encoder import
tolgacangoz 7ee98eb
Fix typo
tolgacangoz 4674ead
up
tolgacangoz 6ee3b85
simplify
tolgacangoz 508cf8d
Adding rope for hidden states and image
tolgacangoz 1fcfeba
style
tolgacangoz 74d6381
Refactor ropes
tolgacangoz 9cd08bc
refactor
tolgacangoz fd3af1d
up
tolgacangoz 97991aa
Preserve the lost dimension explicitly
tolgacangoz 2048861
Use complex rope temporarily
tolgacangoz 17166e2
upp
tolgacangoz b9a7149
style
tolgacangoz 244005a
fix
tolgacangoz fde574d
fix: correct key names in S2V transformer mapping for audio components
tolgacangoz 83567bf
fixes
tolgacangoz 551c74e
Fix errors encountering during inference
tolgacangoz 8341218
up
tolgacangoz 6663e58
a9b08de
Fix bugs and improve stability in WanSpeechToVideo model
tolgacangoz 86123d9
style
tolgacangoz 8064c42
Enhance load_audio function to support audio loading from URLs using …
tolgacangoz ac16d5d
upp
tolgacangoz dbc0764
add _repeated_blocks
tolgacangoz 80a2fbe
up
tolgacangoz acc8ecb
fix
tolgacangoz a729033
up
tolgacangoz a0d5217
fix previous latensts
tolgacangoz 4fd1014
set deterministic for fa2
tolgacangoz 3773de3
up
tolgacangoz d0e3e26
update example docstring
tolgacangoz fe41edd
upp
tolgacangoz f5439e1
up
tolgacangoz eef629d
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz d72f549
style
tolgacangoz 5eea0c7
Enhance load_video function with frame sampling options and reverse p…
tolgacangoz 33e5b67
Refactor load_pose_condition method to simplify pose video handling a…
tolgacangoz 9562c26
style
tolgacangoz a2c1952
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 8c4c018
Update parameter descriptions and simplify tensor operations in WanSp…
tolgacangoz 12facf8
Propose to vectorize to assume each element in a batch standard, same
tolgacangoz 4ab5547
style
tolgacangoz 4e5f357
Fix pose_video tensor initialization to use correct dtype and device
tolgacangoz b9224b9
up
tolgacangoz bcf71db
ıp
tolgacangoz c248b6d
fix
tolgacangoz 206bbaa
up
tolgacangoz a126570
Fix mask_input tensor shape and dimension in WanS2VTransformer3DModel
tolgacangoz bd0b72e
Fix mask_input tensor indexing in WanS2VTransformer3DModel
tolgacangoz 29bddb5
up docs
tolgacangoz 37a44c2
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 746514f
Adds video and audio merging functionality in docs
tolgacangoz b2e57b8
fix: initialize pose_video variable in WanSpeechToVideoPipeline
tolgacangoz 1321330
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz f7fbf36
Enables passing attention kwargs
tolgacangoz 111085f
Propose flash attention with precomputed max_seqlen_k-only
tolgacangoz 11d98e1
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 8b58f63
style
tolgacangoz 66e58b8
Propose to add `FP32RMSNorm`
tolgacangoz f503a26
Fix argument unpacking in audio injector call in WanS2VTransformer3DM…
tolgacangoz 9fe3596
Remove `FP32RMSNorm`
tolgacangoz 3542a46
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz 6761385
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz e0b8ce9
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz b8b6709
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz d2840fc
Update module names
tolgacangoz a6c1b27
Adds `export_to_merged_video_audio` utility
tolgacangoz 2f09d10
style
tolgacangoz 454b442
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz d837dfc
Refactor audio injection logic
tolgacangoz d9fd755
style
tolgacangoz e5ab1dd
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz c6e8fa4
Update src/diffusers/models/transformers/transformer_wan_s2v.py
tolgacangoz 62dc61e
Refactors audio injection logic
tolgacangoz 6b98ebd
Refactor adain mode handling
tolgacangoz 8665fd5
style
tolgacangoz dd15817
revert
tolgacangoz 9f4edb4
Take `AdaLayerNorm` from `normalization`
tolgacangoz 52ffc49
style
tolgacangoz 6196332
Refactor audio encoder with weighted average layer
tolgacangoz 5c50519
style
tolgacangoz d57448a
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz ee1f6ff
Enhance image resizing functionality with additional options for resi…
tolgacangoz e15d3f6
Add resize_mode parameter to preprocess_video for flexible video resi…
tolgacangoz bc2165a
Refactor video processing in WanSpeechToVideoPipeline to support bili…
tolgacangoz 0bf98b6
style
tolgacangoz 226a451
Add `Motioner` class for _simple_ motion processing in `WanS2VTransfo…
tolgacangoz aef52d3
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 7122b61
Add `WanS2VCausalConvLayer` for modularism
tolgacangoz 9dab88f
style
tolgacangoz 70ef9c3
Add CP configs
tolgacangoz 2d6176b
Update attention dispather usage
tolgacangoz 77da3e3
Refactor example docstring for aspect ratio resizing and update num_f…
tolgacangoz 9f61f5c
up docs
tolgacangoz 079dd7d
up docs
tolgacangoz 1c553a1
up test
tolgacangoz dfb99d0
up tests
tolgacangoz b5421f3
down
tolgacangoz 0cbe32e
Add deterministic audio generation and callback configuration test
tolgacangoz 685d86e
up
tolgacangoz 8a5bb49
style
tolgacangoz 97c2125
up
tolgacangoz 83c0a36
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 2c4d83a
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 781e9be
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 2575d47
Use immutable default values
tolgacangoz 2949ca9
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 89f923d
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 3a13398
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz cb615bb
Refactor device handling in WanSpeechToVideoPipeline for consistency
tolgacangoz 1fb2c6b
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 496fa80
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz 54cfc71
Update Wan-S2V model description and contributor info
tolgacangoz 5062ea7
Update Wan-S2V documentation with project page link
tolgacangoz 61eaf8c
Merge branch 'main' into integrations/wan2.2-s2v
tolgacangoz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://huggingface.co/docs/transformers/v4.57.1/en/model_doc/speech_to_text_2#overview:~:text=This%20model%20was%20contributed%20by%20Patrick%20von%20Platen.