Voice OpenCLAP PoC

This is a proof-of-concept (PoC) release. The model was trained on a single dataset for a few epochs as a rapid feasibility test. It is not intended for production use and will be superseded by a properly trained version.

Model

  • Audio encoder: Whisper-small encoder (~88M params, initialized from OpenAI pretrained weights)
  • Text encoder: ModernBERT-base (~149M params, pretrained from answerdotai/ModernBERT-base)
  • Embedding dimension: 512
  • Loss: Symmetric InfoNCE (CLIP-style contrastive loss)

Training

  • Dataset: laion/majestrino-data — ~8.2M audio-caption pairs (raw FLAC audio, 32kHz mono, with natural language captions)
  • Effective batch size: 1024 (128 per GPU x 8 GPUs)
  • Learning rate: 1e-4 cosine schedule, 2000-step warmup
  • Weight decay: 0.05
  • Precision: amp_bf16
  • Logit scale: Clamped at ln(20)
  • Samples per epoch: 10M (resampled)
  • Best checkpoint: Epoch 3 (this release)

Evaluation Results

Evaluated on a held-out ~21,148 sample validation set from the same dataset distribution.

Metric Value
Val Loss 2.798
Audio→Text R@1 4.07%
Audio→Text R@5 10.86%
Audio→Text R@10 15.19%
Text→Audio R@1 3.42%
Text→Audio R@5 9.77%
Text→Audio R@10 14.07%
Full epoch-by-epoch results
Epoch Train Loss Val Loss A→T R@1 A→T R@5 A→T R@10 T→A R@1 T→A R@5 T→A R@10
1 1.49 2.932 3.37% 9.03% 12.91% 2.95% 8.50% 12.28%
2 1.29 2.884 3.56% 9.47% 13.32% 3.05% 9.01% 13.26%
3 1.00 2.798 4.07% 10.86% 15.19% 3.42% 9.77% 14.07%
4 0.95 2.925 3.63% 9.98% 14.03% 3.39% 9.41% 14.02%
5 0.55 2.954 3.54% 9.86% 13.51% 3.27% 9.38% 13.47%
6 ~0.4 3.187 3.14% 8.42% 12.16% 2.90% 8.59% 12.75%
7 ~0.3 3.552 2.87% 8.08% 11.79% 2.76% 8.06% 12.05%
8 0.23 3.649 2.76% 7.78% 11.10% 2.61% 7.71% 11.44%

Overfitting begins after epoch 3 — train loss continues decreasing but val loss and retrieval metrics degrade.

Usage

import torch
from inference import VoiceOpenCLAP, load_model, preprocess_audio

device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer = load_model("model.pt", device=device)

# Encode text
texts = ["a man speaking calmly", "loud music playing"]
tok = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=77)
with torch.no_grad():
    text_embs = model.encode_text(tok.input_ids.to(device), tok.attention_mask.to(device))

# Encode audio
wav = preprocess_audio("audio.wav").to(device)
with torch.no_grad():
    audio_emb = model.encode_audio(wav)

# Similarity
sims = audio_emb @ text_embs.T
print(sims)

Requirements

pip install torch torchaudio transformers openai-whisper

Files

  • model.pt — Model weights (state dict only, ~0.96 GB)
  • inference.py — Self-contained inference script (no external CLAP dependencies needed)
  • model_config.json — Model architecture configuration

Disclaimer

This is an early proof-of-concept. Known limitations:

  • Trained on a single dataset (Maestrino) with limited diversity
  • Begins overfitting after epoch 3 (~30M samples seen)
  • Retrieval metrics are modest (4% R@1) — expected to improve significantly with more data, multi-dataset training, and longer schedules
  • The mel spectrogram computation requires the openai-whisper package for exact filter bank alignment (a fallback using torchaudio is included but may produce slightly different results)

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support