Thread for understanding customisation for end to end use cases
Thanks to nvidia team for releasing the model, lately a lot of spike is happening in the Speech AI domain , I had a couple of questions in my mind , the model is exactly what the industry needs which supports a full duplex architechture with customisation as per role taking but
how can I connect it to fire with with some kind of MCP server (if let's say I wanted some user to query about xyz details related to them, and have built an mcp server for textual answering , how can i plug and play it within the same architechture, is it possible or not,)
how can i use it for my own custom language (Hindi, Arabic, Tamil etc)
Lately nvidia has been releasing multiple speech models is there any plan to bring and patch encoder and decoder of our own choice something like what huggingface allows with AutoCasualLM like module so more engineers and devs have free time in rapid prototyping before going on the hard way of finetuning role
This architecture is quite monolithic so its kind of hard to modularize with patch encoder decoder. Also its English-only at the point.
Excellent point about the MCP server. We don't have tool-calling support like that now, and will try to add support for your scenario in future models.
A workaround that might be possible: You have to prompt the model to say something like "I am looking it up" when it need some external information. Detect that being emitted in the text channel, feed the conversation transcript (use a lightweight ASR model like parakeet in parallel) to a query summarizer prompt, and make the MCP query. Then when the results come back, start a new context of this model with the previous conversation summary and results from the MCP in the text prompt. You could play some sort of "on hold" tone/music in the meantime. This is a very experimental suggestion.
We built a working implementation of the tool-calling architecture @royrajarshi described, and wanted to share what we learned — including what works and what doesn’t.
What Doesn’t Work: text_prompt for Content Delivery
The suggested approach of restarting the model with MCP results in text_prompt runs into a fundamental limitation: text_prompt is fine-tuned for persona shaping only, not content injection.
PersonaPlex’s text_prompt training data (~2,250 hours of synthetic dialogues) only included persona-shaping prompts ("You are a wise teacher", "You are an astronaut named Alex"). It never included patterns like "Tell the user these facts: [data]" or "Relay this information: [results]".
We tested extensively:
"IMPORTANT: Tell the user X"→ model greets normally, ignores content"You just looked something up and found: [results]"→ model greets normally"Your job right now is to share this: X"→ model greets normally- First-person framing (
"You just found: X") → model greets normally
The model adopts persona (name, role, style) from text_prompt perfectly, but will not relay specific facts. The greeting behavior ("Hello, this is [Name]!") is baked into instruction fine-tuning weights and overrides any content in the prompt.
What Works: Drip-Feed Token Injection + External TTS
We built a Talker-Reasoner architecture (based on DeepMind’s "Agents Thinking Fast and Slow") where PersonaPlex is System 1 and a Letta AI agent with persistent memory + web search is System 2.
Key discovery — drip-feed sendText() at Moshi’s frame rate:
Moshi’s Inner Monologue (per the Moshi paper) predicts one text token per audio frame at 12.5Hz (80ms intervals). Burst injection of 300+ characters at once overwhelms temporal alignment and causes repetition degeneration ("plus plus plus...").
Fix: send 20 characters every 80ms, matching the model’s per-frame consumption rate:
const chunks = response.match(/.{1,20}/g) ?? [response];
for (const chunk of chunks) {
sendText(chunk);
await new Promise(r => setTimeout(r, 80));
}
This transfers knowledge into the model’s Inner Monologue without degeneration. However, PersonaPlex also tries to speak the injected text (garbled), so we:
- Gate PersonaPlex audio at the server — drop audio frames during delivery + 8s cooldown
- Play clean TTS audio instead — sentence-chunked external TTS (~500ms to first audio)
The drip-feed gives PersonaPlex the knowledge for follow-up questions, while the external TTS delivers the clean answer.
For the MCP use case @harsh2ai asked about: the pattern is trigger detection → MCP tool call → drip-feed results into PersonaPlex + play TTS for the user. PersonaPlex has no idea MCP exists; it just receives knowledge through its Inner Monologue channel.
Detailed technical writeup: gist
Code: vaos-voice-bridge