ElevenLabs Launches Scribe v2: Industry's Most Accurate Speech-to-Text Model

By GenMediaLab • January 20, 2026 • 5 min read

Key Takeaways

✓ Scribe v2 Realtime delivers 150ms latency for live transcription - as low as 30-80ms in optimized conditions
✓ Supports 90+ languages with automatic language detection and predictive transcription
✓ Batch version includes keyterm prompting for up to 100 technical terms and entity detection for 56 data categories
✓ Speaker diarization supports up to 48 distinct speakers with timestamps
✓ 93.5% accuracy on multilingual benchmarks - outperforms Whisper and Gemini Flash

What Happened

ElevenLabs has released Scribe v2, a new generation of speech-to-text models that the company claims is the most accurate transcription system available. The release consists of two specialized versions:

Scribe v2 Realtime (January 6, 2026) - Optimized for live conversational AI and voice agents
Scribe v2 Batch (January 9, 2026) - Designed for processing long-form audio, subtitling, and captioning at scale

This release positions ElevenLabs to compete directly with OpenAI’s Whisper, Google’s speech recognition, and enterprise transcription services like Rev and Otter.ai.

Try ElevenLabs Scribe v2

Experience the most accurate speech-to-text transcription with 90+ language support and ultra-low latency.

Try ElevenLabs Free →

Scribe v2 Realtime: Built for Conversational AI

The Realtime version is purpose-built for live applications where latency matters - voice assistants, real-time captioning, and conversational AI agents.

Key Capabilities

Feature	Specification
Latency	Under 150ms typical, 30-80ms optimized
Languages	90+ with automatic detection
Accuracy	93.5% on multilingual benchmarks
Voice Activity Detection	Built-in VAD

How It Works

Scribe v2 Realtime uses predictive transcription - the model anticipates upcoming words and punctuation based on context, reducing perceived latency. Unlike traditional ASR systems that wait for complete utterances, Scribe v2 streams partial results as the speaker talks.

The system automatically detects which language is being spoken, handles code-switching between languages, and adapts to accents and background noise without manual configuration.

Performance vs. Competitors

According to ElevenLabs’ benchmarks, Scribe v2 Realtime outperforms:

OpenAI Whisper - Higher accuracy in noisy conditions
Google Gemini Flash - Lower latency with comparable accuracy
Amazon Transcribe - Better handling of accents and dialects

Scribe v2 Batch: Enterprise-Grade Transcription

The Batch version targets different use cases - long podcast episodes, meeting recordings, video subtitles, and legal/medical transcription where accuracy and detail matter more than speed.

Keyterm Prompting

Users can input up to 100 technical terms (brand names, product names, jargon) to ensure context-aware accuracy. This is particularly valuable for:

Medical transcription (drug names, procedures)
Legal depositions (case names, legal terminology)
Technical content (product names, API terms)
Branded content (company names, trademarks)

Entity Detection

Scribe v2 Batch automatically identifies and timestamps 56 categories of sensitive data, including:

Health information (HIPAA-relevant data)
Payment details (credit card numbers, bank accounts)
Personal identifiable information (SSN, addresses, phone numbers)
Credentials (passwords, API keys mentioned in recordings)

This feature is designed for compliance workflows where organizations need to redact sensitive information before sharing transcripts.

Speaker Diarization

The model supports labeling for up to 48 distinct speakers and includes audio-tagging for non-speech events like laughter, applause, and music. Each speaker segment includes precise timestamps.

Why This Matters

For Content Creators

Transcription is a foundational workflow for podcasters, YouTubers, and video producers. Accurate, automated transcription enables:

Searchable content archives - Find any moment by searching the transcript
Accessibility - Generate captions and subtitles automatically
Repurposing - Convert audio content to blog posts, social clips, newsletters
SEO - Search engines index transcript content

For Voice AI Developers

The Realtime model is designed to power the next generation of voice assistants and agents. With sub-150ms latency, developers can build conversational experiences that feel genuinely responsive rather than sluggish.

For Enterprise

The combination of entity detection, speaker diarization, and keyterm prompting addresses real compliance and workflow needs:

Legal - Accurate deposition transcripts with speaker identification
Healthcare - HIPAA-compliant transcription with automatic PII detection
Finance - Meeting minutes with automatic redaction of sensitive numbers

How to Access Scribe v2

Both models are available through:

ElevenLabs API - For developers integrating transcription into applications
ElevenLabs Studio - Web interface for manual transcription tasks
ElevenLabs Agents - Integrated into the conversational AI platform

Pricing

Scribe v2 follows ElevenLabs’ tiered subscription model with specific monthly quotas for both batch and real-time transcription hours. Enterprise customers can negotiate custom pricing for high-volume needs.

Security and Compliance

ElevenLabs emphasizes enterprise-grade security:

SOC 2 Type II compliance
HIPAA readiness for healthcare applications
Zero Retention modes for sensitive workloads (audio deleted after processing)

Build with ElevenLabs Voice AI

Access Scribe v2 alongside text-to-speech, voice cloning, and conversational AI in one platform.

Start Building Free →

The Bigger Picture

ElevenLabs has rapidly expanded from a text-to-speech startup to a full voice AI platform. Scribe v2 completes the audio loop - users can now:

Generate speech with text-to-speech and voice cloning
Transcribe speech back to text with Scribe v2
Build agents that combine both in real-time conversations

This positions ElevenLabs as a one-stop platform for voice AI, competing with larger players like Google, Amazon, and Microsoft who offer similar capabilities across fragmented products.

FAQ

How does Scribe v2 compare to OpenAI Whisper?

ElevenLabs claims Scribe v2 achieves 93.5% accuracy on multilingual benchmarks, outperforming Whisper particularly in noisy conditions and with accented speech. The Realtime version also offers significantly lower latency than Whisper's batch-oriented architecture.

What languages does Scribe v2 support?

Scribe v2 supports over 90 languages with automatic language detection. The model can handle code-switching between languages within the same audio without manual configuration.

Is Scribe v2 HIPAA compliant?

Yes, ElevenLabs offers HIPAA-ready deployment options for healthcare applications, including Zero Retention modes where audio is deleted immediately after processing.

What is keyterm prompting?

Keyterm prompting allows you to provide up to 100 specific terms (brand names, technical jargon, proper nouns) that the model should recognize accurately. This improves accuracy for domain-specific content.

How many speakers can Scribe v2 distinguish?

The Batch version supports speaker diarization for up to 48 distinct speakers, with timestamps for each speaker segment and automatic labeling of non-speech events.

What is the latency for real-time transcription?

Scribe v2 Realtime typically achieves under 150ms latency, with optimized configurations reaching 30-80ms. This is fast enough for live conversational AI applications.

Sources

Was this article helpful?

Affiliate Disclosure: This review contains affiliate links. If you purchase through our links, we may earn a commission at no additional cost to you. We only recommend tools we've personally tested and believe provide genuine value to our readers.