ElevenLabs Launches Scribe v2: Industry's Most Accurate Speech-to-Text Model
Key Takeaways
- ✓ Scribe v2 Realtime delivers 150ms latency for live transcription - as low as 30-80ms in optimized conditions
- ✓ Supports 90+ languages with automatic language detection and predictive transcription
- ✓ Batch version includes keyterm prompting for up to 100 technical terms and entity detection for 56 data categories
- ✓ Speaker diarization supports up to 48 distinct speakers with timestamps
- ✓ 93.5% accuracy on multilingual benchmarks - outperforms Whisper and Gemini Flash
What Happened
ElevenLabs has released Scribe v2, a new generation of speech-to-text models that the company claims is the most accurate transcription system available. The release consists of two specialized versions:
- Scribe v2 Realtime (January 6, 2026) - Optimized for live conversational AI and voice agents
- Scribe v2 Batch (January 9, 2026) - Designed for processing long-form audio, subtitling, and captioning at scale
This release positions ElevenLabs to compete directly with OpenAI’s Whisper, Google’s speech recognition, and enterprise transcription services like Rev and Otter.ai.
Try ElevenLabs Scribe v2
Experience the most accurate speech-to-text transcription with 90+ language support and ultra-low latency.
Try ElevenLabs Free →Scribe v2 Realtime: Built for Conversational AI
The Realtime version is purpose-built for live applications where latency matters - voice assistants, real-time captioning, and conversational AI agents.
Key Capabilities
| Feature | Specification |
|---|---|
| Latency | Under 150ms typical, 30-80ms optimized |
| Languages | 90+ with automatic detection |
| Accuracy | 93.5% on multilingual benchmarks |
| Voice Activity Detection | Built-in VAD |
How It Works
Scribe v2 Realtime uses predictive transcription - the model anticipates upcoming words and punctuation based on context, reducing perceived latency. Unlike traditional ASR systems that wait for complete utterances, Scribe v2 streams partial results as the speaker talks.
The system automatically detects which language is being spoken, handles code-switching between languages, and adapts to accents and background noise without manual configuration.
Performance vs. Competitors
According to ElevenLabs’ benchmarks, Scribe v2 Realtime outperforms:
- OpenAI Whisper - Higher accuracy in noisy conditions
- Google Gemini Flash - Lower latency with comparable accuracy
- Amazon Transcribe - Better handling of accents and dialects
Scribe v2 Batch: Enterprise-Grade Transcription
The Batch version targets different use cases - long podcast episodes, meeting recordings, video subtitles, and legal/medical transcription where accuracy and detail matter more than speed.
Keyterm Prompting
Users can input up to 100 technical terms (brand names, product names, jargon) to ensure context-aware accuracy. This is particularly valuable for:
- Medical transcription (drug names, procedures)
- Legal depositions (case names, legal terminology)
- Technical content (product names, API terms)
- Branded content (company names, trademarks)
Entity Detection
Scribe v2 Batch automatically identifies and timestamps 56 categories of sensitive data, including:
- Health information (HIPAA-relevant data)
- Payment details (credit card numbers, bank accounts)
- Personal identifiable information (SSN, addresses, phone numbers)
- Credentials (passwords, API keys mentioned in recordings)
This feature is designed for compliance workflows where organizations need to redact sensitive information before sharing transcripts.
Speaker Diarization
The model supports labeling for up to 48 distinct speakers and includes audio-tagging for non-speech events like laughter, applause, and music. Each speaker segment includes precise timestamps.
Why This Matters
For Content Creators
Transcription is a foundational workflow for podcasters, YouTubers, and video producers. Accurate, automated transcription enables:
- Searchable content archives - Find any moment by searching the transcript
- Accessibility - Generate captions and subtitles automatically
- Repurposing - Convert audio content to blog posts, social clips, newsletters
- SEO - Search engines index transcript content
For Voice AI Developers
The Realtime model is designed to power the next generation of voice assistants and agents. With sub-150ms latency, developers can build conversational experiences that feel genuinely responsive rather than sluggish.
For Enterprise
The combination of entity detection, speaker diarization, and keyterm prompting addresses real compliance and workflow needs:
- Legal - Accurate deposition transcripts with speaker identification
- Healthcare - HIPAA-compliant transcription with automatic PII detection
- Finance - Meeting minutes with automatic redaction of sensitive numbers
How to Access Scribe v2
Both models are available through:
- ElevenLabs API - For developers integrating transcription into applications
- ElevenLabs Studio - Web interface for manual transcription tasks
- ElevenLabs Agents - Integrated into the conversational AI platform
Pricing
Scribe v2 follows ElevenLabs’ tiered subscription model with specific monthly quotas for both batch and real-time transcription hours. Enterprise customers can negotiate custom pricing for high-volume needs.
Security and Compliance
ElevenLabs emphasizes enterprise-grade security:
- SOC 2 Type II compliance
- HIPAA readiness for healthcare applications
- Zero Retention modes for sensitive workloads (audio deleted after processing)
Build with ElevenLabs Voice AI
Access Scribe v2 alongside text-to-speech, voice cloning, and conversational AI in one platform.
Start Building Free →The Bigger Picture
ElevenLabs has rapidly expanded from a text-to-speech startup to a full voice AI platform. Scribe v2 completes the audio loop - users can now:
- Generate speech with text-to-speech and voice cloning
- Transcribe speech back to text with Scribe v2
- Build agents that combine both in real-time conversations
This positions ElevenLabs as a one-stop platform for voice AI, competing with larger players like Google, Amazon, and Microsoft who offer similar capabilities across fragmented products.
FAQ
How does Scribe v2 compare to OpenAI Whisper?
ElevenLabs claims Scribe v2 achieves 93.5% accuracy on multilingual benchmarks, outperforming Whisper particularly in noisy conditions and with accented speech. The Realtime version also offers significantly lower latency than Whisper's batch-oriented architecture.
What languages does Scribe v2 support?
Scribe v2 supports over 90 languages with automatic language detection. The model can handle code-switching between languages within the same audio without manual configuration.
Is Scribe v2 HIPAA compliant?
Yes, ElevenLabs offers HIPAA-ready deployment options for healthcare applications, including Zero Retention modes where audio is deleted immediately after processing.
What is keyterm prompting?
Keyterm prompting allows you to provide up to 100 specific terms (brand names, technical jargon, proper nouns) that the model should recognize accurately. This improves accuracy for domain-specific content.
How many speakers can Scribe v2 distinguish?
The Batch version supports speaker diarization for up to 48 distinct speakers, with timestamps for each speaker segment and automatic labeling of non-speech events.
What is the latency for real-time transcription?
Scribe v2 Realtime typically achieves under 150ms latency, with optimized configurations reaching 30-80ms. This is fast enough for live conversational AI applications.