10.09.2025 12:43

Wan2.2-S2V: A 14-Billion-Parameter Model for Cinematic Video Generation from Audio

News image

The latest Wan2.2-S2V model, boasting 14 billion parameters, transforms static images and audio into dynamic, cinematic-quality videos featuring realistic facial expressions, natural body movements, and professional camera work.  

Key Features:

  • High Dynamic Consistency: Ensures smooth, stable animations throughout the video.  
  • Superior Audio-Video Sync: Perfectly aligns facial movements and articulation with sound.  
  • Motion and Environment Control via Text Prompts: Allows customization of gestures, emotions, backgrounds, and character actions (e.g., "man walking on tracks," "girl singing in the rain," "old man playing piano by the sea").  
  • Complex Scenario Support: Handles advanced effects like camera motion, rain, wind, parachutes, and filming from a moving train.  

Taking a single image and an audio file as input, Wan2.2-S2V outputs synchronized videos tailored to text prompts.  


Performance Highlights:

Testing shows the model rivals or exceeds competitors, with metrics including:  

  • FID ↓ 15.66 (high video quality),  
  • EFID ↓ 0.283 (natural facial expressions),  
  • CSIM ↑ 0.677 (character identity preservation).  SSIM, PSNR, and Sync-C scores further confirm its visual clarity, stability, and audio synchronization.  

Fully open-source, the model provides access to its code and weights, and appears compatible with LoRA adapters from Wan 2.x. Try it online at https://wan.video.


Also read:

Thank you!
Join us on social media!
See you!


0 comments
Read more