Qwen-VLA: Alibaba’s Unified Vision-Language-Action Model Brings Versatile Robot Control to a New Level

The Qwen team at Alibaba has released Qwen-VLA, a powerful new Vision-Language-Action (VLA) model that can control a wide variety of robots — from single-arm manipulators to dual-arm systems and potentially humanoids — using a single unified model without task-specific or platform-specific retraining.

Announced on May 29, 2026, Qwen-VLA represents a significant step toward generalist embodied AI, addressing one of the biggest challenges in robotics: fragmentation across different hardware platforms and task types.

How Qwen-VLA Works

Like other VLA models, Qwen-VLA takes two inputs:

A real-time image (or video frames) from the robot’s camera(s);
A natural language command (e.g., “Pick up the red cup and place it on the shelf” or “Navigate to the kitchen table while avoiding obstacles”).

It outputs continuous actions or trajectories tailored to the specific robot’s kinematics.

Technical Architecture:

Built on the Qwen3.5-4B vision-language backbone;
Augmented with a 1.15 billion parameter DiT (Diffusion Transformer) flow-matching action decoder;
Uses embodiment-aware prompt conditioning — simply describe the robot type and control interface in text, and the model adapts instantly.

This design allows the same model to handle three major categories of tasks:

Manipulation (grasping, moving objects, bimanual coordination);
Navigation (vision-language navigation in indoor environments);
Trajectory prediction (predicting and generating smooth motion paths).

Switching between different robot embodiments requires only changing the textual prompt — no new training heads or fine-tuning needed.

Impressive Benchmark Performance

Qwen-VLA demonstrates strong results across both simulation and real-world settings, often matching or surpassing specialist models trained for single tasks or platforms:

LIBERO (manipulation benchmark): 97.9% success (near state-of-the-art);
RoboTwin-Hard: 87.2%;
Simpler-WidowX: 73.7% (outperforms many specialists);
Real-world ALOHA dual-arm robot:

- 83.6% average success in familiar (in-domain) conditions;

- 76.9% average success in unfamiliar (out-of-distribution) conditions — significantly better than π₀.5 from Physical Intelligence (71.6% / 41.5%).

It also shows competitive performance in navigation benchmarks (R2R, RxR) and zero-shot capabilities on dynamic tasks like DOMINO.

Why This Matters

Most current robotics AI systems are highly specialized: one model for picking, another for walking, yet another for each robot arm type. Qwen-VLA pushes toward a true generalist policy — a single brain that can be deployed across many platforms with minimal adaptation.

This approach dramatically reduces development time and cost for robot deployment, making advanced AI control more accessible for research labs, startups, and industry applications in manufacturing, logistics, and home assistance.

The model is open-sourced with code, weights, and a technical report available on GitHub and arXiv, following Qwen’s tradition of releasing powerful models to the community.

The Road Ahead for Embodied AI

Qwen-VLA joins other notable efforts like NVIDIA’s GR00T and Physical Intelligence’s π₀.5 in the race toward general-purpose robot intelligence. Its strength lies in unification and strong generalization across embodiments and environments.

As hardware improves and more diverse training data becomes available, models like Qwen-VLA could accelerate the arrival of truly versatile, language-guided robots that understand and act in the physical world as naturally as today’s LLMs understand text.

This release strengthens Alibaba’s position in the rapidly evolving embodied AI space and provides the open-source community with a powerful new tool for robot development. Exciting times ahead for physical intelligence!

Qwen-VLA: Alibaba’s Unified Vision-Language-Action Model Brings Versatile Robot Control to a New Level

How Qwen-VLA Works

Impressive Benchmark Performance

Why This Matters

The Road Ahead for Embodied AI

Subscribe to our newsletter