31.01.2026 14:28Author: Viacheslav Vasipenok

Inworld Unleashes TTS-1.5: Real-Time Voice So Good It Feels Illegal at $0.005/min

News image

Inworld AI has launched TTS-1.5, a major upgrade to its text-to-speech engine, positioning it as the leading real-time voice AI solution currently available.

Announced on January 21, 2026, the release builds on Inworld's existing top-ranked models, delivering breakthrough improvements in latency, emotional expressiveness, stability, and affordability while maintaining the #1 spot on the Artificial Analysis TTS Leaderboard.

This leaderboard relies on blind evaluations by thousands of real users assessing naturalness and human-likeness, making it one of the most credible independent benchmarks in the TTS space.

Inworld TTS-1.5 outperforms competitors — including established players like ElevenLabs (Multilingual v2) and OpenAI's TTS offerings — in user preference metrics, combining superior perceptual quality with production-ready performance for real-time applications.


Ultra-Low Latency for True Real-Time Conversations

The standout feature of TTS-1.5 is its dramatically reduced latency, enabling fluid, interruptible voice interactions that feel truly conversational.

  • TTS-1.5 Mini achieves P90 time-to-first-audio latency under 130 ms, ideal for hyper-sensitive use cases where every millisecond counts, such as live customer support agents or interactive gaming NPCs.
  • TTS-1.5 Max delivers under 250 ms P90 latency while prioritizing maximum expressiveness and fidelity.

These figures represent a 4x speedup over Inworld's previous generations, with median latencies dipping below 200 ms (Max) and 100 ms (Mini).

Such performance unlocks seamless back-and-forth dialogue at scale, supporting thousands of concurrent queries without robotic pauses or awkward overlaps — critical for consumer-facing voice AI where engagement drops sharply with delays above 300–500 ms.


Enhanced Expressiveness and Stability

Beyond speed, TTS-1.5 significantly boosts audio quality:

  • 30% greater emotional expressiveness compared to prior Inworld models, allowing voices to convey nuanced tone, context-aware inflection, and personality that keeps users immersed.
  • 40% reduction in word error rate (WER), minimizing hallucinations, abrupt cutoffs, and audio artifacts for cleaner, more reliable output.

The result is speech described as "virtually indistinguishable from human speaking: emotionally nuanced, contextually aware, and reliably accurate." Inworld claims professional voice actor quality at human-native speeds, with improved voice cloning (instant from 5–15 seconds of audio, or fine-tuned professional options) that captures subtle prosody and timbre.


Aggressive Pricing and Broad Accessibility

Inworld's pricing undercuts the competition dramatically — often by 20–25x — making high-quality real-time TTS viable for large-scale deployments:

  • Mini: $0.005 per minute ($5 per million characters)
  • Max: $0.01 per minute ($10 per million characters)

This affordability gap has widened as some rivals have increased rates.

Combined with support for 15 languages (including expanded coverage and the addition of Hindi), TTS-1.5 enables multilingual agents without prohibitive costs.

For enterprises requiring data sovereignty or custom scaling, Inworld offers full on-premise deployment on user infrastructure (e.g., H100/B200 clusters), alongside cloud API access with global availability, SLAs, volume pricing, and tailored architectures.


Competitive Landscape: A Serious Challenger Emerges

Inworld TTS-1.5 arrives at a pivotal moment in voice AI. ElevenLabs remains a leader in voice variety (hundreds of options across 29+ languages) and ultra-realistic cloning, but its higher pricing — often $0.18+ per audio hour equivalent — and comparatively higher latency in some modes make it less ideal for massive real-time consumer applications.

OpenAI's TTS provides strong integration within its ecosystem and good prosody control via prompting, yet it trails in blind leaderboard rankings for naturalness and expressiveness while carrying premium costs.

Independent comparisons (including Artificial Analysis arena data) show Inworld models consistently preferred by users for human-likeness, with TTS-1.5 widening the lead through combined speed, emotion, and stability.

Developers note that Inworld's value proposition — top-tier quality at a fraction of the cost — makes it especially attractive for building high-volume voice agents, live dubbing/translation, interactive entertainment, and accessibility tools like advanced screen readers.


Also read:


Looking Ahead

Inworld positions TTS-1.5 as the foundation for its next wave of voice AI advancements, with ongoing integrations into platforms like LiveKit, Vapi, NLX, Pipecat, Ultravox, Voximplant, and Layercode. Quotes from partners underscore the excitement: Layercode's CEO called the realism "unmatched at a fraction of the cost," while NLX's co-founder highlighted Inworld's role in driving conversational voice as the future primary interface.

As real-time voice becomes central to AI companions, agents, and immersive experiences, Inworld's latest release raises the bar for what production-grade TTS can achieve — delivering Hollywood-level performance at startup-friendly prices and latencies that finally match human conversation rhythms. For developers racing to ship engaging, scalable voice products in 2026, this could be the tool that tips the scales.


0 comments
Read more