Voice AI Rising: How Audio Assistants Are Set to Dominate 2026
Voice AI is transforming from clunky robots to smart agents with $6.6B in VC funding.
Read Article →
NVIDIA has released PersonaPlex-7B-v1, a 7 billion parameter speech-to-speech model that fundamentally changes how voice AI handles conversation. Unlike every voice assistant you have used before, PersonaPlex does not wait for you to finish talking before it starts responding. It listens and speaks at the same time.
This is called full-duplex interaction, and it is the same way humans naturally converse. You can interrupt it mid-sentence, and it adapts. It produces backchannels like “uh-huh” and “oh, okay” while you are still speaking. It pauses when appropriate. No rigid turn-taking. No awkward silence while the AI processes your words.
PersonaPlex-7B-v1 is released under the NVIDIA Open Model License (weights) and MIT License (code). Both permit commercial use. Download from Hugging Face or GitHub.
Traditional voice assistants run a three-stage pipeline that creates an unnatural conversation flow:
The cascaded pipeline behind Siri, Alexa, and Google Assistant
| Stage | Process | Problem |
|---|---|---|
| 1. ASR | Automatic Speech Recognition converts speech to text | Adds latency |
| 2. LLM | Language model generates a text response | Cannot hear you while thinking |
| 3. TTS | Text-to-Speech converts response to audio | More latency, no overlap |
Each stage adds delay, and the system cannot hear you while it is generating a response. This is why conversations with Siri, Alexa, or Google Assistant feel robotic. You speak, wait, get a response, speak again.
PersonaPlex replaces this entire pipeline with a single Transformer model that processes incoming audio and generates speech simultaneously.
Listens and speaks simultaneously with natural interruptions, backchannels, and rapid turn-taking - no waiting required
Define any role through text prompts (personality, business rules) plus audio voice conditioning (accent, tone, prosody)
Average response time of 0.205-0.265 seconds - 5.7x faster than Moshi, the model it builds on
Handles scenarios outside its training data, like technical crisis management, thanks to the Helium language model backbone
Produces pauses, emotional tones, stress, urgency, and contextual responses that mirror human conversation patterns
NVIDIA Open Model License (weights) and MIT (code) allow full commercial deployment and modification
PersonaPlex is built on the Moshi architecture from Kyutai, with Helium as the underlying language model backbone. The architecture uses two parallel streams:
Both streams share the same model state. This means PersonaPlex can adjust its response in real time as the user speaks, enabling barge-in, overlapping speech, rapid turn-taking, and contextual backchannels.
The Mimi neural audio codec handles audio encoding and decoding at 24 kHz, converting waveforms into discrete tokens that the Transformer can process.
PersonaPlex uses two inputs to define conversational identity:
This hybrid approach lets you create a customer service agent for a specific company with a specific voice, a wise teacher who sounds warm and patient, or a fantasy character with dramatic inflection. The persona stays consistent throughout the entire conversation.
PersonaPlex maintains persona consistency across extended conversations
The astronaut scenario is particularly notable. Emergency crisis management, reactor physics vocabulary, and emotional urgency were never in the training data. PersonaPlex generalized from its Helium language model backbone to handle entirely new domains.
NVIDIA evaluated PersonaPlex on FullDuplexBench and a new extension called ServiceDuplexBench for customer service scenarios. The results show clear advantages over both open-source and commercial alternatives.
Success rate (higher is better)
| Metric | PersonaPlex | Moshi | Gemini Live | Qwen 2.5 Omni |
|---|---|---|---|---|
| Smooth Turn Taking | 90.8% | 1.8% | 43.9% | N/A |
| User Interruption | 95.0% | 65.3% | 54.7% | N/A |
| Pause Handling | 60.6% | 33.6% | 65.5% | N/A |
Response time in seconds (lower is better)
| Metric | PersonaPlex | Moshi | Gemini Live |
|---|---|---|---|
| Smooth Turn Taking | 0.170s | 0.953s | N/A |
| User Interruption | 0.240s | 1.409s | N/A |
| Average | 0.205s | 1.181s | N/A |
GPT-4o judge score out of 5 (higher is better)
| Benchmark | PersonaPlex | Moshi | Gemini Live | Qwen 2.5 Omni |
|---|---|---|---|---|
| FullDuplexBench | 4.29 | 0.77 | 3.38 | 4.59 |
| ServiceDuplexBench | 4.40 | 1.75 | 4.73 | 2.76 |
| Average | 4.34 | 1.26 | 4.05 | 3.68 |
PersonaPlex is the only model that scores above 4.0 on both benchmarks, combining strong general knowledge with reliable task-following in structured business scenarios.
PersonaPlex was trained in a single stage using a carefully designed blend of real and synthetic conversations.
7,303 calls (1,217 hours) from the Fisher English corpus provided natural conversational patterns - backchannels, disfluencies, emotional responses, and authentic turn-taking behavior. These recordings were back-annotated with persona prompts using GPT-OSS-120B at varying levels of detail.
The training design disentangles two qualities: naturalness from real conversations and task adherence from synthetic scenarios. The hybrid prompt format bridges both data sources, letting the model combine natural speech patterns with precise instruction following.
PersonaPlex represents a significant shift in what open-source voice AI can do. Until now, the choice was between customizable but robotic cascaded systems and natural but inflexible full-duplex models. PersonaPlex eliminates that trade-off.
The model is ready for commercial use. Developers building voice agents, customer service bots, or interactive characters now have an open-source foundation that rivals proprietary systems. The MIT-licensed code means full freedom to modify and deploy.
Full-duplex interaction has been the holy grail of conversational AI. Google, OpenAI, and others have invested heavily in making voice assistants feel more natural. NVIDIA has now open-sourced a model that achieves this at the 7B parameter scale, lowering the barrier for anyone to build truly conversational voice interfaces.
Voice-first interfaces are accelerating across customer service, accessibility tools, gaming, and content creation. PersonaPlex’s persona control makes it practical for specific business use cases where the AI needs to sound on-brand and follow structured scripts while still feeling human.
Compare the best AI voice generators for text-to-speech, voice cloning, and conversational AI.
Try ElevenLabs Free →PersonaPlex-7B-v1 is an impressive first release, but there are constraints to be aware of before deploying.
Everything you need to run PersonaPlex
Requires a Linux machine with an NVIDIA GPU (Ampere or Hopper) and Python installed.
1. Install the audio codec and clone the repo:
# Ubuntu/Debian
sudo apt install libopus-dev
# Clone and install
git clone https://github.com/NVIDIA/personaplex.git
cd personaplex
pip install moshi/.
2. Accept the model license on Hugging Face, then set your token:
export HF_TOKEN=your_token_here
3. Launch the server (auto-generates temporary SSL certs):
SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR"
4. Open https://localhost:8998 in your browser. Start talking — PersonaPlex responds in real time.
Add --cpu-offload to the server command to offload layers to CPU. Requires pip install accelerate first.
PersonaPlex-7B-v1 is a 7 billion parameter speech-to-speech AI model from NVIDIA that enables real-time, full-duplex voice conversations. It can listen and speak simultaneously, handle interruptions naturally, and maintain customizable personas through hybrid prompting.
Traditional voice assistants use a three-stage pipeline (speech recognition, language model, text-to-speech) that creates delays and cannot handle overlapping speech. PersonaPlex uses a single model that processes audio in real time, enabling natural conversation with sub-second latency of 0.205-0.265 seconds.
Yes. The model weights are released under the NVIDIA Open Model License and the code is MIT-licensed. Both permit commercial use. You can download everything from Hugging Face and GitHub at no cost.
PersonaPlex requires NVIDIA GPUs, specifically Ampere or Hopper architecture cards like the A100 or H100. It is not currently optimized for consumer GPUs or non-NVIDIA hardware.
Not yet. The current release is English-only. The training data is entirely in English, using the Fisher English corpus plus English synthetic conversations.
PersonaPlex uses hybrid prompting. A text prompt defines the role, background, and scenario (such as 'You work for First Neuron Bank and your name is Sanni Virtanen'). A voice prompt provides an audio embedding that controls vocal characteristics like accent, tone, and speaking style. Together, they create a consistent persona.