Instructions to use nenad1002/microsoft-vibevoice-0.5B-onnx-fp16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use nenad1002/microsoft-vibevoice-0.5B-onnx-fp16 with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("nenad1002/microsoft-vibevoice-0.5B-onnx-fp16") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "nenad1002/microsoft-vibevoice-0.5B-onnx-fp16", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
VibeVoice-Realtime-0.5B β ONNX FP16
Microsoft's VibeVoice-Realtime-0.5B exported to ONNX format in FP16 precision with KV-cache support.
Zero PyTorch dependency at inference time. Only requires: onnxruntime, numpy, soundfile, tokenizers.
Architecture
5 ONNX models forming the VibeVoice TTS pipeline:
| Model | Size | Description |
|---|---|---|
text_lm_kv.onnx |
374 MB | 4-layer Qwen2 text encoder with KV-cache |
tts_lm_kv.onnx |
572 MB | 20-layer Qwen2 TTS LM with KV-cache + EOS classifier |
diffusion_head.onnx |
81 MB | Latent denoiser (5-step DPM-Solver++) |
vocoder.onnx |
656 MB | Acoustic decoder (latents β 24kHz audio) |
acoustic_connector.onnx |
1.7 MB | Speech feedback projection |
Plus speaker voice presets (.npz files) and the inference script.
Usage
pip install onnxruntime numpy soundfile tokenizers
python vibevoice_full_onnx.py --text "Hello, this is a test." --speaker Carter
Options
python vibevoice_full_onnx.py \
--text "Your text here" \
--speaker Carter \
--output output.wav \
--cfg_scale 1.5
Available speakers: Carter, Frank, Emma, Grace, Davis, Mike, Wayne, and multilingual (de-Spk0, fr-Spk1, sp-Spk0, etc.)
Pipeline Flow
Text β [text_lm_kv] β hidden states
β
[tts_lm_kv] β [acoustic_connector] β speech latent
β
[diffusion_head] Γ 5 steps with CFG
β
speech latent (64-dim)
β
[vocoder] β audio (24kHz)
Export Details
- Exported from
microsoft/VibeVoice-Realtime-0.5Bin FP16 - Voice presets converted from BF16 β FP16
- DPM-Solver++ scheduler implemented in pure NumPy
- KV-cache passed as explicit inputs/outputs for stateful generation
- ONNX opset 18
License
MIT (same as original VibeVoice model)
- Downloads last month
- 3