In a move that has sent shockwaves through the AI filmmaking community, Tencent has released HunyuanVideo 1.5, an 8.3-billion-parameter open-source video generation model that is now widely regarded as the strongest fully open video foundation model on the planet.
While closed giants like OpenAI’s Sora, Google’s Veo 2, and Kling 1.5 still lock their best weights behind APIs and enterprise contracts, HunyuanVideo 1.5 is completely free for commercial and research use under an Apache 2.0 license. Every weight, every line of inference code, and even the training data recipe are now public on GitHub and Hugging Face.
What actually sets it apart
- True 1080p cinematic output: Native generation is 768×512 at 24 fps for 5–10 seconds, but a built-in two-stage super-resolution module (trained jointly with the base DiT) pushes final renders to crisp 1920×1080 with film-grade texture and lighting fidelity.
- Runs on consumer hardware: The full 8.3 B model fits in ~13.6 GB VRAM at BF16 precision. Users are already producing 1080p clips on a single RTX 4090 or even an RTX 3090 Ti in under 4 minutes per 5-second clip using TensorRT-LLM optimizations.
- Motion coherence that finally competes with the closed leaders: Independent benchmarks (VBench, T2V-Score, and human preference studies on GenAI-Arena) place HunyuanVideo 1.5 neck-and-neck with Kling 1.5 and ahead of Runway Gen-3, Luma Dream Machine, and Pika 1.5 in complex motion, camera control, and prompt adherence.
- Multi-modal conditioning out of the box: Text-to-video, image-to-video, video-to-video, depth-map control, and reference-image styling all ship in the same checkpoint.
Architecture highlights
HunyuanVideo 1.5 is built on a pure Diffusion Transformer (DiT) backbone with several clever departures from earlier open models like Open-Sora or Stable Video Diffusion:
- 3D Causal VAE with 8×8×4 spatio-temporal compression (instead of the usual 8×8×8) that preserves significantly more high-frequency detail.
- Rotary positional embeddings extended to the temporal dimension, giving the model a native understanding of camera motion and physics.
- Flow-matching training in the latent space (a trick borrowed from recent image papers) that yields dramatically cleaner trajectories than standard denoising objectives.
- A 2-billion-parameter lightweight super-resolution DiT that was jointly trained with the base model, eliminating the usual “blurry upscaling” look that has plagued most open-source attempts.
Real-world impact already happening
Within 72 hours of release:
- Indie filmmakers on X and Reddit reported generating entire mood-reels and pre-vis sequences that previously required $50–$200 per minute on paid APIs.
- ComfyUI and Automatic1111 forks added native HunyuanVideo nodes; the most popular one already has 40 k+ downloads.
- Chinese studios are using it in production for virtual production backgrounds and VFX plate generation, citing cost savings of 70–90 % compared to Kling Pro or Runway credits.
The new democratisation line
For the first time, a single hobbyist with a $1,500 GPU can now generate video that rivals what Hollywood studios were paying six-figure sums for just twelve months ago. The gap between “closed corporate AI” and “what anyone can run at home” has never been smaller.
HunyuanVideo 1.5 isn’t just another research checkpoint; it’s the moment when cinematic video synthesis officially escaped the walled gardens.
Model, code, and demos: https://hunyuan.tencent.com/video/en
GitHub: https://github.com/Tencent-Hunyuan/HunyuanVideo
Hugging Face: https://huggingface.co/Tencent-Hunyuan/HunyuanVideo-1.5
The age of truly open cinematic AI has arrived, and it runs on a gaming card.

