15.01.2026 18:49Author: Viacheslav Vasipenok

Meituan Unveils LongCat-Image: A Compact 6B Bilingual Powerhouse Redefining Open-Source Image Generation

News image

In late 2025, Meituan — China's leading super-app for food delivery, travel, and local services — made waves in the AI community by open-sourcing LongCat-Image, a groundbreaking bilingual (Chinese-English) foundation model for text-to-image generation and editing.

With just 6 billion parameters, this lightweight diffusion-based model punches far above its weight class, delivering photorealistic outputs, exceptional text rendering, and production-ready performance that rivals or exceeds much larger open-source and even some proprietary systems.

Developed by Meituan's dedicated LongCat AI team, LongCat-Image arrives amid intense competition in multimodal AI from Chinese tech giants like Alibaba, Tencent, and ByteDance. Yet it stands out by prioritizing efficiency, real-world usability, and multilingual excellence over sheer scale.


Breaking the Scaling Myth: Quality Through Smart Design

Conventional wisdom in image generation suggests bigger is better — models like Flux.1-dev (12B+), Hunyuan-DiT variants (up to 80B parameters), or proprietary giants often rely on massive parameter counts for top-tier results. LongCat-Image flips this narrative.

At only 6B parameters and built on a hybrid Multimodal Diffusion Transformer (MM-DiT) architecture, the model achieves remarkable efficiency. It requires roughly 17–18 GB of VRAM for inference (with CPU offloading options), enabling faster generation times and lower deployment costs compared to 20B+ counterparts.

Benchmarks confirm its prowess:

  • On T2I-CoreBench (a comprehensive text-to-image evaluation), LongCat-Image ranks 2nd among all open-source models as of December 2025 — trailing only the much larger 32B Flux2.dev.
  • It delivers open-source state-of-the-art (SOTA) results on image editing benchmarks such as GEdit-Bench (Chinese: 7.60 / English: 7.64), ImgEdit-Bench (~4.50), and CEdit-Bench.
  • In Chinese text rendering, it scores an impressive 90.7 on the ChineseWord benchmark, covering all 8,105 standard Chinese characters with high accuracy, stability, and support for rare/complex glyphs.

These gains stem from meticulous data curation rather than brute-force scaling. The team applied rigorous filtering to exclude low-quality and AI-generated images, employed multi-stage refinement (pre-training → mid-training → SFT → RLHF with reward models), and introduced a specialized character-level encoding strategy: when text appears in quotation marks ('...' / "..." / ‘…’ / “…”), the model switches to per-character tokenization, dramatically improving legibility and fidelity in both languages.


Standout Strengths: Where LongCat-Image Shines

1. Superior Bilingual Text Rendering

LongCat-Image excels at embedding accurate, stable Chinese and English text into images — a notorious pain point for most diffusion models. Posters, signs, menus, product labels, and UI mockups come out crisp and natural, even with mixed-language prompts or intricate typography. This makes it especially valuable for e-commerce, advertising, and cross-border design workflows.

2. Photorealism and Aesthetic Quality

Through innovative data strategies and progressive training, the model produces studio-grade visuals: believable lighting, accurate textures, realistic proportions, and physics-aware object placement. Subjective mean opinion scores (MOS) place it among the best open-source options for realism, composition, and overall appeal.

3. Unified Generation + Editing Pipeline

The family includes:

  • LongCat-Image — core text-to-image model.
  • LongCat-Image-Edit — specialized for instruction-based editing (object swap, background change, style adjustment, text modification) with strong consistency preservation across multi-turn edits.
  • LongCat-Image-Dev — mid-training checkpoints for fine-tuning and research.

All variants support Diffusers integration, LoRA adapters, ComfyUI workflows, and full training code release under Apache 2.0, lowering barriers for developers to customize styles, fine-tune on domain data, or build production pipelines.

4. Production Focus Over Benchmark Chasing

Meituan emphasizes real scenarios: fast high-resolution output (1024×1024 responsive on 16GB GPUs), batch generation for marketing assets, and reliable instruction following without layout drift.

Early user reports on Reddit (r/StableDiffusion, r/LocalLLaMA) praise its editing consistency — unlike some competitors that warp characters or shift compositions during refinements.

Also read:


Broader Impact and Ecosystem Play

LongCat-Image is already powering features in Meituan's own apps and web platforms, demonstrating immediate commercial viability. Its release — complete with technical report (arXiv:2512.07584), Hugging Face repo (meituan-longcat/LongCat-Image), GitHub training code, and community WeChat group — signals Meituan's ambition to foster an open, collaborative Chinese AI ecosystem.

In a field dominated by closed models and resource-heavy training, LongCat-Image proves that intelligent architecture, clean data, and targeted optimization can deliver frontier-level performance at a fraction of the cost. For developers, designers, and businesses needing reliable bilingual image tools — especially those handling Chinese content — this 6B model may become the new go-to open-source baseline.

Explore it today at: https://modelscope.cn/models/meituan-longcat/ or Hugging Face. With weights, pipelines, and documentation fully available, the era of accessible, high-fidelity bilingual image AI just got a major upgrade.


0 comments
Read more