Voice OpenCLAP PoC
This is a proof-of-concept (PoC) release. The model was trained on a single dataset for a few epochs as a rapid feasibility test. It is not intended for production use and will be superseded by a properly trained version.
Model
- Audio encoder: Whisper-small encoder (~88M params, initialized from OpenAI pretrained weights)
- Text encoder: ModernBERT-base (~149M params, pretrained from
answerdotai/ModernBERT-base) - Embedding dimension: 512
- Loss: Symmetric InfoNCE (CLIP-style contrastive loss)
Training
- Dataset: laion/majestrino-data — ~8.2M audio-caption pairs (raw FLAC audio, 32kHz mono, with natural language captions)
- Effective batch size: 1024 (128 per GPU x 8 GPUs)
- Learning rate: 1e-4 cosine schedule, 2000-step warmup
- Weight decay: 0.05
- Precision: amp_bf16
- Logit scale: Clamped at ln(20)
- Samples per epoch: 10M (resampled)
- Best checkpoint: Epoch 3 (this release)
Evaluation Results
Evaluated on a held-out ~21,148 sample validation set from the same dataset distribution.
| Metric | Value |
|---|---|
| Val Loss | 2.798 |
| Audio→Text R@1 | 4.07% |
| Audio→Text R@5 | 10.86% |
| Audio→Text R@10 | 15.19% |
| Text→Audio R@1 | 3.42% |
| Text→Audio R@5 | 9.77% |
| Text→Audio R@10 | 14.07% |
Full epoch-by-epoch results
| Epoch | Train Loss | Val Loss | A→T R@1 | A→T R@5 | A→T R@10 | T→A R@1 | T→A R@5 | T→A R@10 |
|---|---|---|---|---|---|---|---|---|
| 1 | 1.49 | 2.932 | 3.37% | 9.03% | 12.91% | 2.95% | 8.50% | 12.28% |
| 2 | 1.29 | 2.884 | 3.56% | 9.47% | 13.32% | 3.05% | 9.01% | 13.26% |
| 3 | 1.00 | 2.798 | 4.07% | 10.86% | 15.19% | 3.42% | 9.77% | 14.07% |
| 4 | 0.95 | 2.925 | 3.63% | 9.98% | 14.03% | 3.39% | 9.41% | 14.02% |
| 5 | 0.55 | 2.954 | 3.54% | 9.86% | 13.51% | 3.27% | 9.38% | 13.47% |
| 6 | ~0.4 | 3.187 | 3.14% | 8.42% | 12.16% | 2.90% | 8.59% | 12.75% |
| 7 | ~0.3 | 3.552 | 2.87% | 8.08% | 11.79% | 2.76% | 8.06% | 12.05% |
| 8 | 0.23 | 3.649 | 2.76% | 7.78% | 11.10% | 2.61% | 7.71% | 11.44% |
Overfitting begins after epoch 3 — train loss continues decreasing but val loss and retrieval metrics degrade.
Usage
import torch
from inference import VoiceOpenCLAP, load_model, preprocess_audio
device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer = load_model("model.pt", device=device)
# Encode text
texts = ["a man speaking calmly", "loud music playing"]
tok = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=77)
with torch.no_grad():
text_embs = model.encode_text(tok.input_ids.to(device), tok.attention_mask.to(device))
# Encode audio
wav = preprocess_audio("audio.wav").to(device)
with torch.no_grad():
audio_emb = model.encode_audio(wav)
# Similarity
sims = audio_emb @ text_embs.T
print(sims)
Requirements
pip install torch torchaudio transformers openai-whisper
Files
model.pt— Model weights (state dict only, ~0.96 GB)inference.py— Self-contained inference script (no external CLAP dependencies needed)model_config.json— Model architecture configuration
Disclaimer
This is an early proof-of-concept. Known limitations:
- Trained on a single dataset (Maestrino) with limited diversity
- Begins overfitting after epoch 3 (~30M samples seen)
- Retrieval metrics are modest (4% R@1) — expected to improve significantly with more data, multi-dataset training, and longer schedules
- The mel spectrogram computation requires the
openai-whisperpackage for exact filter bank alignment (a fallback using torchaudio is included but may produce slightly different results)
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support