Voice OpenCLAP PoC

This is a proof-of-concept (PoC) release. The model was trained on a single dataset for a few epochs as a rapid feasibility test. It is not intended for production use and will be superseded by a properly trained version.

Model

Audio encoder: Whisper-small encoder (~88M params, initialized from OpenAI pretrained weights)
Text encoder: ModernBERT-base (~149M params, pretrained from answerdotai/ModernBERT-base)
Embedding dimension: 512
Loss: Symmetric InfoNCE (CLIP-style contrastive loss)

Training

Dataset: laion/majestrino-data — ~8.2M audio-caption pairs (raw FLAC audio, 32kHz mono, with natural language captions)
Effective batch size: 1024 (128 per GPU x 8 GPUs)
Learning rate: 1e-4 cosine schedule, 2000-step warmup
Weight decay: 0.05
Precision: amp_bf16
Logit scale: Clamped at ln(20)
Samples per epoch: 10M (resampled)
Best checkpoint: Epoch 3 (this release)

Evaluation Results

Evaluated on a held-out ~21,148 sample validation set from the same dataset distribution.

Metric	Value
Val Loss	2.798
Audio→Text R@1	4.07%
Audio→Text R@5	10.86%
Audio→Text R@10	15.19%
Text→Audio R@1	3.42%
Text→Audio R@5	9.77%
Text→Audio R@10	14.07%

Full epoch-by-epoch results

Epoch	Train Loss	Val Loss	A→T R@1	A→T R@5	A→T R@10	T→A R@1	T→A R@5	T→A R@10
1	1.49	2.932	3.37%	9.03%	12.91%	2.95%	8.50%	12.28%
2	1.29	2.884	3.56%	9.47%	13.32%	3.05%	9.01%	13.26%
3	1.00	2.798	4.07%	10.86%	15.19%	3.42%	9.77%	14.07%
4	0.95	2.925	3.63%	9.98%	14.03%	3.39%	9.41%	14.02%
5	0.55	2.954	3.54%	9.86%	13.51%	3.27%	9.38%	13.47%
6	~0.4	3.187	3.14%	8.42%	12.16%	2.90%	8.59%	12.75%
7	~0.3	3.552	2.87%	8.08%	11.79%	2.76%	8.06%	12.05%
8	0.23	3.649	2.76%	7.78%	11.10%	2.61%	7.71%	11.44%

Overfitting begins after epoch 3 — train loss continues decreasing but val loss and retrieval metrics degrade.

Usage

import torch
from inference import VoiceOpenCLAP, load_model, preprocess_audio

device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer = load_model("model.pt", device=device)

# Encode text
texts = ["a man speaking calmly", "loud music playing"]
tok = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=77)
with torch.no_grad():
    text_embs = model.encode_text(tok.input_ids.to(device), tok.attention_mask.to(device))

# Encode audio
wav = preprocess_audio("audio.wav").to(device)
with torch.no_grad():
    audio_emb = model.encode_audio(wav)

# Similarity
sims = audio_emb @ text_embs.T
print(sims)

Requirements

pip install torch torchaudio transformers openai-whisper

Files

model.pt — Model weights (state dict only, ~0.96 GB)
inference.py — Self-contained inference script (no external CLAP dependencies needed)
model_config.json — Model architecture configuration

Disclaimer

This is an early proof-of-concept. Known limitations:

Trained on a single dataset (Maestrino) with limited diversity
Begins overfitting after epoch 3 (~30M samples seen)
Retrieval metrics are modest (4% R@1) — expected to improve significantly with more data, multi-dataset training, and longer schedules
The mel spectrogram computation requires the openai-whisper package for exact filter bank alignment (a fallback using torchaudio is included but may produce slightly different results)

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support