NVIDIA PersonaPlex-7B: Open Source Full-Duplex Voice AI

By GenMediaLab 6 min read
Dual sound waves crossing in real-time representing NVIDIA PersonaPlex full-duplex voice AI

Key Takeaways

  • NVIDIA releases PersonaPlex-7B-v1, a 7 billion parameter speech-to-speech model that listens and speaks at the same time
  • Full-duplex design eliminates the pause-talk-pause cycle of traditional voice assistants with sub-second latency (0.205-0.265s)
  • Hybrid prompting lets you define any persona through text descriptions plus audio-based voice conditioning
  • Outperforms Gemini Live, Qwen 2.5 Omni, and Moshi on conversational dynamics and task adherence benchmarks
  • 100% open source: model weights under NVIDIA Open Model License, code under MIT

What Happened

NVIDIA has released PersonaPlex-7B-v1, a 7 billion parameter speech-to-speech model that fundamentally changes how voice AI handles conversation. Unlike every voice assistant you have used before, PersonaPlex does not wait for you to finish talking before it starts responding. It listens and speaks at the same time.

This is called full-duplex interaction, and it is the same way humans naturally converse. You can interrupt it mid-sentence, and it adapts. It produces backchannels like “uh-huh” and “oh, okay” while you are still speaking. It pauses when appropriate. No rigid turn-taking. No awkward silence while the AI processes your words.

🧠 7B Parameters
0.2s Avg Latency
📖 MIT Code License
📊 <5K hrs Training Data
Fully Open Source

PersonaPlex-7B-v1 is released under the NVIDIA Open Model License (weights) and MIT License (code). Both permit commercial use. Download from Hugging Face or GitHub.

Why Traditional Voice AI Falls Short

Traditional voice assistants run a three-stage pipeline that creates an unnatural conversation flow:

The cascaded pipeline behind Siri, Alexa, and Google Assistant

Stage Process Problem
1. ASR Automatic Speech Recognition converts speech to text Adds latency
2. LLM Language model generates a text response Cannot hear you while thinking
3. TTS Text-to-Speech converts response to audio More latency, no overlap

Each stage adds delay, and the system cannot hear you while it is generating a response. This is why conversations with Siri, Alexa, or Google Assistant feel robotic. You speak, wait, get a response, speak again.

PersonaPlex replaces this entire pipeline with a single Transformer model that processes incoming audio and generates speech simultaneously.

Core Capabilities

🔄

Full-Duplex Conversation

Listens and speaks simultaneously with natural interruptions, backchannels, and rapid turn-taking - no waiting required

🎭

Hybrid Persona Control

Define any role through text prompts (personality, business rules) plus audio voice conditioning (accent, tone, prosody)

Sub-Second Latency

Average response time of 0.205-0.265 seconds - 5.7x faster than Moshi, the model it builds on

🧠

Emergent Generalization

Handles scenarios outside its training data, like technical crisis management, thanks to the Helium language model backbone

🎙️

Non-Verbal Cues

Produces pauses, emotional tones, stress, urgency, and contextual responses that mirror human conversation patterns

🔓

Commercial-Ready Open Source

NVIDIA Open Model License (weights) and MIT (code) allow full commercial deployment and modification

How PersonaPlex Works

Dual-Stream Architecture

PersonaPlex is built on the Moshi architecture from Kyutai, with Helium as the underlying language model backbone. The architecture uses two parallel streams:

  • User stream - continuously encodes incoming audio from the user’s microphone
  • Agent stream - simultaneously generates the AI’s speech and text response

Both streams share the same model state. This means PersonaPlex can adjust its response in real time as the user speaks, enabling barge-in, overlapping speech, rapid turn-taking, and contextual backchannels.

The Mimi neural audio codec handles audio encoding and decoding at 24 kHz, converting waveforms into discrete tokens that the Transformer can process.

Hybrid Persona Control

PersonaPlex uses two inputs to define conversational identity:

  • Text prompt - describes the role, background, organization, and conversation context (up to 200 tokens)
  • Voice prompt - an audio embedding that captures vocal characteristics, speaking style, accent, and prosody

This hybrid approach lets you create a customer service agent for a specific company with a specific voice, a wise teacher who sounds warm and patient, or a fantasy character with dramatic inflection. The persona stays consistent throughout the entire conversation.

Demonstrated Personas

PersonaPlex maintains persona consistency across extended conversations

Persona
Scenario
Key Behavior
Wise Teacher
General Q&A assistant
Natural turn-taking, broad knowledge
Bank Agent (Sanni Virtanen)
Flagged transaction investigation
Empathy, identity verification, accent control
Medical Receptionist
New patient registration
Records details from speech, maintains confidentiality
Astronaut (Alex)
Reactor core emergency on Mars mission
Stress, urgency, technical reasoning outside training data
Beyond Training Data

The astronaut scenario is particularly notable. Emergency crisis management, reactor physics vocabulary, and emotional urgency were never in the training data. PersonaPlex generalized from its Helium language model backbone to handle entirely new domains.

Benchmark Results

NVIDIA evaluated PersonaPlex on FullDuplexBench and a new extension called ServiceDuplexBench for customer service scenarios. The results show clear advantages over both open-source and commercial alternatives.

Conversational Dynamics

Success rate (higher is better)

Metric PersonaPlex Moshi Gemini Live Qwen 2.5 Omni
Smooth Turn Taking 90.8% 1.8% 43.9% N/A
User Interruption 95.0% 65.3% 54.7% N/A
Pause Handling 60.6% 33.6% 65.5% N/A

Latency

Response time in seconds (lower is better)

Metric PersonaPlex Moshi Gemini Live
Smooth Turn Taking 0.170s 0.953s N/A
User Interruption 0.240s 1.409s N/A
Average 0.205s 1.181s N/A

Task Adherence

GPT-4o judge score out of 5 (higher is better)

Benchmark PersonaPlex Moshi Gemini Live Qwen 2.5 Omni
FullDuplexBench 4.29 0.77 3.38 4.59
ServiceDuplexBench 4.40 1.75 4.73 2.76
Average 4.34 1.26 4.05 3.68

PersonaPlex is the only model that scores above 4.0 on both benchmarks, combining strong general knowledge with reliable task-following in structured business scenarios.

Training: Less Than 5,000 Hours

PersonaPlex was trained in a single stage using a carefully designed blend of real and synthetic conversations.

Real Conversations

7,303 calls (1,217 hours) from the Fisher English corpus provided natural conversational patterns - backchannels, disfluencies, emotional responses, and authentic turn-taking behavior. These recordings were back-annotated with persona prompts using GPT-OSS-120B at varying levels of detail.

Synthetic Conversations

  • 39,322 assistant dialogs (410 hours) - generated with Qwen3-32B and GPT-OSS-120B, synthesized to audio with Chatterbox TTS from Resemble AI
  • 105,410 customer service dialogs (1,840 hours) - covering various business scenarios with structured prompts including company names, pricing, and operational rules

The training design disentangles two qualities: naturalness from real conversations and task adherence from synthetic scenarios. The hybrid prompt format bridges both data sources, letting the model combine natural speech patterns with precise instruction following.

What This Means for Voice AI

PersonaPlex represents a significant shift in what open-source voice AI can do. Until now, the choice was between customizable but robotic cascaded systems and natural but inflexible full-duplex models. PersonaPlex eliminates that trade-off.

For Developers

The model is ready for commercial use. Developers building voice agents, customer service bots, or interactive characters now have an open-source foundation that rivals proprietary systems. The MIT-licensed code means full freedom to modify and deploy.

For the Voice AI Industry

Full-duplex interaction has been the holy grail of conversational AI. Google, OpenAI, and others have invested heavily in making voice assistants feel more natural. NVIDIA has now open-sourced a model that achieves this at the 7B parameter scale, lowering the barrier for anyone to build truly conversational voice interfaces.

For Creators and Businesses

Voice-first interfaces are accelerating across customer service, accessibility tools, gaming, and content creation. PersonaPlex’s persona control makes it practical for specific business use cases where the AI needs to sound on-brand and follow structured scripts while still feeling human.

Explore AI Voice Technology

Compare the best AI voice generators for text-to-speech, voice cloning, and conversational AI.

Try ElevenLabs Free →

Current Limitations

Early Release Constraints

PersonaPlex-7B-v1 is an impressive first release, but there are constraints to be aware of before deploying.

  • English only - no multilingual support yet
  • Requires NVIDIA GPUs - optimized for Ampere and Hopper architectures (A100, H100)
  • Limited training data - under 5,000 hours, which may restrict performance in niche dialects or specialized domains
  • No production safety testing - NVIDIA notes that bias, explainability, and privacy concerns need additional testing before production deployment

How to Get Started

Everything you need to run PersonaPlex

Resource
Link
License
Model Weights
NVIDIA Open Model License — commercial use permitted
Source Code
MIT License — no restrictions
Research Paper
Open Access
Base Model (Moshi)
CC-BY-4.0 — share with attribution

Quick Start (5 minutes)

Requires a Linux machine with an NVIDIA GPU (Ampere or Hopper) and Python installed.

1. Install the audio codec and clone the repo:

# Ubuntu/Debian
sudo apt install libopus-dev

# Clone and install
git clone https://github.com/NVIDIA/personaplex.git
cd personaplex
pip install moshi/.

2. Accept the model license on Hugging Face, then set your token:

export HF_TOKEN=your_token_here

3. Launch the server (auto-generates temporary SSL certs):

SSL_DIR=$(mktemp -d); python -m moshi.server --ssl "$SSL_DIR"

4. Open https://localhost:8998 in your browser. Start talking — PersonaPlex responds in real time.

Low GPU Memory?

Add --cpu-offload to the server command to offload layers to CPU. Requires pip install accelerate first.

FAQ

What is NVIDIA PersonaPlex-7B?

PersonaPlex-7B-v1 is a 7 billion parameter speech-to-speech AI model from NVIDIA that enables real-time, full-duplex voice conversations. It can listen and speak simultaneously, handle interruptions naturally, and maintain customizable personas through hybrid prompting.

How is PersonaPlex different from regular voice assistants?

Traditional voice assistants use a three-stage pipeline (speech recognition, language model, text-to-speech) that creates delays and cannot handle overlapping speech. PersonaPlex uses a single model that processes audio in real time, enabling natural conversation with sub-second latency of 0.205-0.265 seconds.

Is PersonaPlex free to use?

Yes. The model weights are released under the NVIDIA Open Model License and the code is MIT-licensed. Both permit commercial use. You can download everything from Hugging Face and GitHub at no cost.

What hardware do I need to run PersonaPlex?

PersonaPlex requires NVIDIA GPUs, specifically Ampere or Hopper architecture cards like the A100 or H100. It is not currently optimized for consumer GPUs or non-NVIDIA hardware.

Does PersonaPlex support languages other than English?

Not yet. The current release is English-only. The training data is entirely in English, using the Fisher English corpus plus English synthetic conversations.

How does persona control work in PersonaPlex?

PersonaPlex uses hybrid prompting. A text prompt defines the role, background, and scenario (such as 'You work for First Neuron Bank and your name is Sanni Virtanen'). A voice prompt provides an audio embedding that controls vocal characteristics like accent, tone, and speaking style. Together, they create a consistent persona.


Sources

  1. NVIDIA ADLR - PersonaPlex: Natural Conversational AI With Any Role and Voice
  2. MarkTechPost - NVIDIA Releases PersonaPlex-7B-v1
  3. NVIDIA PersonaPlex-7B-v1 on Hugging Face
  4. PersonaPlex GitHub Repository

Was this article helpful?