ElevenLabs Launches Scribe v2 Realtime: A Breakthrough in Ultra-Low Latency Speech-to-Text

In a move set to transform how AI interacts with human speech, ElevenLabs has unveiled Scribe v2 Realtime, a cutting-edge speech-to-text model optimized for real-time applications. Announced on January 6, 2026, this new iteration builds on the company's established Scribe lineup, emphasizing lightning-fast transcription with minimal latency while supporting over 90 languages. Designed primarily for conversational AI agents, live subtitling, translation, and other dynamic scenarios, Scribe v2 Realtime addresses the challenges of natural human speech, making it a game-changer for developers and enterprises alike.

ElevenLabs Launches Scribe v2 Realtime: A Breakthrough in Ultra-Low Latency Speech-to-Text ElevenLabs, known for its voice synthesis tools, has positioned Scribe v2 Realtime as a companion to its broader ecosystem, including the ElevenLabs Agents platform. The model is already available through the company's API, enabling seamless integration into voice-driven systems.

This release comes amid growing demand for accurate, instantaneous speech recognition, with the global speech-to-text market projected to reach $12.1 billion by 2028, driven by advancements in AI agents and virtual assistants.

Core Features and Technical Capabilities

At the heart of Scribe v2 Realtime is its ultra-low latency of approximately 150 milliseconds — down to 30-80ms in optimized conditions — allowing for near-instantaneous transcription of live audio streams. This is achieved through a streaming-first architecture that processes audio in chunks without buffering, ensuring fluid performance in high-stakes environments.

ElevenLabs Launches Scribe v2 Realtime: A Breakthrough in Ultra-Low Latency Speech-to-Text The model supports a wide array of audio formats, including PCM (from 8kHz to 48kHz) and μ-law encoding, making it compatible with telephony systems, web browsers, and professional recording setups.

Key features include:

Predictive Transcription: The AI anticipates upcoming words and punctuation, enhancing accuracy during ongoing speech.
Voice Activity Detection (VAD): Automatically identifies the start and end of speech segments based on silence, reducing errors from background noise.
Manual Commit Control: Developers can manually finalize transcript segments, offering flexibility for custom applications.
Text Conditioning: Maintains context across interruptions or resets, ensuring coherent outputs in unstable connections.
Complex Vocabulary Handling: Excels at transcribing technical terms, proper nouns, medications, and domain-specific jargon.

With support for over 90 languages, including major ones like English, French, German, Italian, Spanish, Portuguese, Hindi, Japanese, Mandarin, Vietnamese, Polish, and Swedish, the model accommodates diverse accents, dialects, and acoustic conditions. Notably, it does not include speaker diarization or dual-channel support, focusing instead on core real-time transcription strengths.

Advancements Over Previous Models

Compared to its predecessor, Scribe v1, and even the batch-oriented Scribe v2 (released shortly after for long-form audio), Scribe v2 Realtime represents a significant leap in handling the nuances of human conversation. It better manages pauses, breaths, filler words (like "um" or "ah"), tone variations, and environmental noises — elements that often trip up older models. Trained on vast, diverse datasets, it achieves state-of-the-art accuracy, with internal benchmarks showing superior performance in challenging English conversations featuring poor audio quality and accents.

On the FLEURS multilingual benchmark, which evaluates accuracy across 30 languages, Scribe v2 Realtime boasts a 93.5% accuracy rate and the lowest Word Error Rate (WER) among low-latency ASR models. Independent comparisons highlight its edge over competitors: It outperforms Google's Gemini Flash 2.5 (90% accuracy), OpenAI's GPT-4o Mini (85%), and Deepgram's Nova 3 (80%) in real-time scenarios, setting a new industry benchmark.

Versatile Applications with a Focus on AI Agents

ElevenLabs Launches Scribe v2 Realtime: A Breakthrough in Ultra-Low Latency Speech-to-Text While versatile enough for subtitling, live translation, and captioning, ElevenLabs emphasizes Scribe v2 Realtime's role in powering AI agents. Integrated into the ElevenLabs Agents platform (as an optional upgrade from the default model), it enables natural, responsive voice interactions—think virtual assistants that "listen" and reply in real time, capturing intents accurately even in noisy settings like call centers or meetings.

Other applications include:

Meeting Assistants: Real-time transcription for virtual conferences, with automatic detection of key terms.
Live Call Centers: Accurate capture of details like emails, phone numbers, and addresses during conversations.
Voice-Enabled Apps: From educational tools to gaming, where instant speech-to-text enhances user engagement.
Accessibility Features: Providing live captions for videos or events, supporting multilingual audiences.

The model's enterprise-grade security features, including SOC 2, ISO 27001, PCI DSS Level 1, HIPAA, and GDPR compliance, along with EU data residency and zero-retention modes, make it suitable for regulated industries like healthcare and finance.

Seamless API Integration for Developers

Developers can access Scribe v2 Realtime via ElevenLabs' API, which supports WebSocket for streaming audio and receiving transcripts in real time. The integration is straightforward: Authenticate with an API key, send audio chunks, and receive partial or final transcripts with options for VAD and commit controls. Code examples in Python, JavaScript, and other languages are available in the documentation, facilitating quick setup for custom agents or apps.

Pricing starts at $0.28 per hour of audio processed, with discounts for annual Business plans and scalable options for high-volume users. Enterprise clients benefit from higher concurrency limits (up to 30+ simultaneous streams) and dedicated support.

The web app at elevenlabs.io/app/speech-to-text offers a user-friendly interface for testing, where users can upload audio, select languages, and generate transcripts or subtitles instantly. It includes features like entity detection (for PII, health data, etc.) and automatic multi-language switching, though advanced options like speaker diarization are reserved for the batch Scribe v2 model.

ElevenLabs Launches Scribe v2 Realtime: A Breakthrough in Ultra-Low Latency Speech-to-Text Also read:

Looking Ahead: The Future of Conversational AI

Scribe v2 Realtime's launch underscores ElevenLabs' commitment to advancing voice AI, following their recent expansions into multilingual dubbing and agentic tools. As AI agents become ubiquitous — projected to handle 30% of customer interactions by 2028 — this model could accelerate adoption in sectors from e-commerce to education.

Early feedback on platforms like Reddit and LinkedIn praises its accuracy and speed, with developers noting easy integration and superior handling of real-world speech imperfections. However, as with any AI tool, users should verify outputs in critical applications. For those building the next generation of voice tech, Scribe v2 Realtime is now live — ready to turn spoken words into actionable text at the speed of thought.