Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding

Kuaishou has released Keye 2.0 (Keye-VL-2.0-30B-A3B), a multimodal model that marks a meaningful step forward for long-context video understanding. The standout innovation is the successful integration of DeepSeek Sparse Attention into video processing — something that has been difficult to achieve effectively until now.

256K Context Without Attention Collapse

Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding The model supports a 256K context window, allowing it to reliably process hour-long videos while maintaining coherent understanding across the entire timeline.

Traditional attention mechanisms often suffer from “attention collapse” on long inputs: the model performs well on the beginning and end of a video but loses causal connections and temporal coherence in the middle.

Keye 2.0 addresses this by leveraging sparse attention patterns originally developed by DeepSeek. This enables the model to focus computational resources more efficiently on relevant parts of the video sequence rather than attending uniformly to every token.

Model Architecture and Efficiency

Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding Keye 2.0 is a 30B-parameter Mixture-of-Experts (MoE) model with only 3B active parameters during inference. This architecture delivers strong performance while keeping compute requirements relatively manageable.

Key efficiency gains include:

Approximately 50% lower prefill cost compared to dense models of similar capability.
Better scaling with longer inputs — performance on VideoMME V2 improves significantly as the number of frames increases (from 35.34% at 64 frames to 42.44% at 512 frames).

Strong Results on Long Video Benchmarks

The model achieves 74.10 on LongVideoBench, demonstrating solid long-form video understanding.

It shows particular strength in tasks requiring:

Timestamp awareness;
Causal reasoning across time;
Understanding tutorials and step-by-step processes;
Gaming footage analysis;
Long-form vlogs and narrative videos.

These capabilities make it especially relevant for real-world applications where videos are long and context matters.

Why This Matters

Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding Most multimodal models still struggle with long video inputs because quadratic attention becomes prohibitively expensive and attention mechanisms degrade over very long sequences. By successfully adapting DeepSeek’s sparse attention technique to video, Kuaishou has shown a practical path toward more efficient and coherent long-context multimodal reasoning.

The ability to maintain temporal logic and causal understanding across hour-long videos (rather than just short clips) opens doors for better video search, summarization, editing assistance, educational tools, and content analysis at scale.

Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding Also read:

Open Weights

Kuaishou has made the model weights publicly available on ModelScope:

→ Keye-VL-2.0-30B-A3B

This release adds to the growing ecosystem of strong open multimodal models and provides researchers and developers with a concrete example of sparse attention working effectively on long video tasks.

Keye 2.0 doesn’t just push context length — it demonstrates that clever attention mechanisms can make long video understanding both feasible and efficient. For anyone working on video AI, this is worth paying attention to.