Keye 2.0 from Kuaishou: Bringing DeepSeek Sparse Attention to Long Video Understanding

Kuaishou has released Keye 2.0 (Keye-VL-2.0-30B-A3B), a multimodal model that marks a meaningful step forward for long-context video understanding. The standout innovation is the successful integration of DeepSeek Sparse Attention into video processing — something that has been difficult to achieve effectively until now.
256K Context Without Attention Collapse

Traditional attention mechanisms often suffer from “attention collapse” on long inputs: the model performs well on the beginning and end of a video but loses causal connections and temporal coherence in the middle.
Keye 2.0 addresses this by leveraging sparse attention patterns originally developed by DeepSeek. This enables the model to focus computational resources more efficiently on relevant parts of the video sequence rather than attending uniformly to every token.
Model Architecture and Efficiency

Key efficiency gains include:
- Approximately 50% lower prefill cost compared to dense models of similar capability.
- Better scaling with longer inputs — performance on VideoMME V2 improves significantly as the number of frames increases (from 35.34% at 64 frames to 42.44% at 512 frames).
Strong Results on Long Video Benchmarks
The model achieves 74.10 on LongVideoBench, demonstrating solid long-form video understanding.

- Timestamp awareness;
- Causal reasoning across time;
- Understanding tutorials and step-by-step processes;
- Gaming footage analysis;
- Long-form vlogs and narrative videos.
These capabilities make it especially relevant for real-world applications where videos are long and context matters.
Why This Matters

The ability to maintain temporal logic and causal understanding across hour-long videos (rather than just short clips) opens doors for better video search, summarization, editing assistance, educational tools, and content analysis at scale.

- Second Circuit Upholds Sam Bankman-Fried’s 25-Year Sentence as Elon Musk Becomes World’s First Trillionaire
- iFixit Teardown Confirms: Trump Mobile T1 Is Just a Gold-Painted HTC U24 Pro
- ECB Moves to Rein In Revolut’s “Self-Guided Missiles” in Europe
- Meta Smart Glasses App Contains Fully Built — But Currently Dormant — On-Device Facial Recognition System
Open Weights
Kuaishou has made the model weights publicly available on ModelScope:
This release adds to the growing ecosystem of strong open multimodal models and provides researchers and developers with a concrete example of sparse attention working effectively on long video tasks.
Keye 2.0 doesn’t just push context length — it demonstrates that clever attention mechanisms can make long video understanding both feasible and efficient. For anyone working on video AI, this is worth paying attention to.
Subscribe to our newsletter
Get the latest Web3, AI, and crypto news delivered straight to your inbox.