Instructions to use AutoArk-AI/ARK-ASR-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AutoArk-AI/ARK-ASR-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("AutoArk-AI/ARK-ASR-0.6B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
TL;DR ARK-ASR-0.6B is an automatic speech recognition model trained with teacher-data adaptation and on-policy distillation, using a compact 0.6B-scale decoder LLM together with a dedicated audio encoder and adapter. The accompanying training, inference, and evaluation code is available at AutoArk/open-audio-opd.
Abstract
ARK-ASR is an audio ASR student model optimized with the teacher-data adaptation + online policy distillation (TD + OPD) recipe from open-audio-opd.
Instead of relying only on static supervised transcripts, OPD lets the student generate transcripts online and trains it against token-level teacher scores on the student's own generated behavior. This checkpoint corresponds to the Ark-Base+TD+OPD model reported in the open-audio-opd results.
ARK-ASR currently supports Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian ASR.
Supported Languages
Chinese, English, German, Japanese, French, Korean, Spanish, Polish, Italian, Romanian, Hungarian, Czech, Dutch, Finnish, Croatian, Slovak, Slovene, Estonian, and Lithuanian.
Model Overview
Figure 1: ARK-ASR architecture. Audio is encoded by a Whisper-style encoder with RoPE, merged through an MLP adapter, and injected into a Qwen2 decoder by replacing audio placeholder token embeddings before transcript generation.
- Model size: 0.6B decoder LLM parameters, with a separate 0.6B-scale Whisper-style audio encoder and MLP adapter
- Task: automatic speech recognition
- Architecture: audio-capable autoregressive Transformers model with custom
arkasrremote code - Checkpoint format:
safetensors; an INT8 ONNX package is also provided for edge-device ASR inference - Sampling rate: 16 kHz
- Recommended inference code:
scripts/infer/ark_asr_transformers.py - vLLM serving:
scripts/vllm/ark_asr_vllm
The model should be loaded with trust_remote_code=True. The official inference script handles the processor, tokenizer, audio prompt format, generation cleanup, and ASR token filtering.
INT8 ONNX for Edge Inference
This repository also provides an INT8 ONNX package for local edge-device ASR inference:
ark_asr_int8_onnx/
The package is intended for deployment scenarios where a compact ONNX Runtime pipeline is preferred over loading the full Transformers checkpoint. It includes:
- INT8 ONNX files for the decoder, audio encoder, and audio adapter
- FP32 token embedding ONNX assets
- tokenizer, processor, and runtime configuration files
- a self-contained ASR inference script:
infer_ark_audio_onnx.py
Install the runtime dependencies:
pip install onnxruntime torch transformers librosa soundfile numpy
Run transcription from the ONNX package directory:
cd ark_asr_int8_onnx
python infer_ark_audio_onnx.py \
--audio /path/to/audio.wav \
--max-new-tokens 128
The ONNX inference script applies ASR decoding filters for non-text control tokens by default.
The full usage guide is available in ark_asr_int8_onnx/README_INT8_ASR_USAGE.md.
Performance
The following results are from the open-audio-opd evaluation. Lower CER/WER is better.
English WER
| Model | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | VoxPopuli | Avg |
|---|---|---|---|---|---|---|---|---|
| Ark-ASR | 11.54% | 10.07% | 8.95% | 1.87% | 3.89% | 2.89% | 6.63% | 6.55% |
| Qwen3-ASR-0.6B | 11.66% | 11.06% | 9.14% | 2.13% | 4.45% | 3.03% | 7.07% | 6.93% |
| Qwen3-ASR-1.7B | 10.56% | 10.25% | 8.74% | 1.63% | 3.40% | 2.84% | 6.35% | 6.25% |
Chinese CER
| Model | AISHELL-1 | Wenet-meeting | Wenet-net | Avg |
|---|---|---|---|---|
| Ark-ASR | 2.02% | 5.92% | 4.96% | 4.30% |
| Qwen3-ASR-0.6B | 2.07% | 5.57% | 5.45% | 4.36% |
| Qwen3-ASR-1.7B | 1.50% | 4.69% | 4.55% | 3.58% |
Ark-ASR is the 0.6B-scale ASR checkpoint trained with teacher-data adaptation and on-policy distillation from open-audio-opd.
Inference
Run ASR inference with Hugging Face Transformers:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_path = "AutoArk-AI/ARK-ASR-0.6B"
audio_path = "assets/libai.wav"
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch_dtype,
attn_implementation="sdpa",
).to(device)
model.eval()
def build_bad_words_ids(tokenizer):
eos_ids = tokenizer.eos_token_id
keep_ids = {eos_ids} if isinstance(eos_ids, int) else set(eos_ids or [])
bad_ids = set(tokenizer.all_special_ids) - keep_ids
bad_ids.update(
token_id
for token, token_id in tokenizer.get_added_vocab().items()
if token.startswith("<") and token.endswith(">") and token_id not in keep_ids
)
return [[token_id] for token_id in sorted(bad_ids)]
conversation = [
{
"role": "user",
"content": [
{"type": "audio", "path": audio_path},
{"type": "text", "text": "Please transcribe this audio."},
],
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
sampling_rate=16000,
audio_padding="longest",
text_kwargs={"padding": "longest"},
audio_max_length=30 * 16000,
)
inputs = inputs.to(device)
if "audios" in inputs:
inputs["audios"] = inputs["audios"].to(dtype=torch_dtype)
bad_words_ids = build_bad_words_ids(tokenizer)
with torch.inference_mode():
outputs = model.generate(
**inputs,
do_sample=False,
max_new_tokens=256,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
bad_words_ids=bad_words_ids,
)
decoded_outputs = tokenizer.batch_decode(
outputs[:, inputs.input_ids.shape[1] :],
skip_special_tokens=True,
)
print(decoded_outputs)
For batch JSONL inference, use the open-source inference code:
git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e .
The input JSONL should contain one ASR sample per line:
{"audio":"/path/to/audio.wav","text":"","task":"asr","begin_time":-1,"end_time":-1}
python scripts/infer/ark_asr_transformers.py \
--input /path/to/input.jsonl \
--output runs/infer/predictions.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
The output JSONL preserves input metadata and adds:
pred_text: cleaned prediction text for downstream evaluationpred_text_raw: raw decoded generation before cleanup
vLLM Online Serving
ARK-ASR can also be deployed as a vLLM-backed online ASR service with the
adapter in
scripts/vllm/ark_asr_vllm.
The service exposes both a compact /asr endpoint and an OpenAI-style
/v1/audio/transcriptions endpoint.
Clone and install the serving code:
git clone https://github.com/AutoArk/open-audio-opd
cd open-audio-opd
pip install -e ".[vllm]"
Start the service:
MODEL=AutoArk-AI/ARK-ASR-0.6B \
GPU=0 \
PORT=8025 \
scripts/vllm/deploy_ark_asr_vllm_service.sh start
Check the service:
scripts/vllm/deploy_ark_asr_vllm_service.sh status
curl -sS http://127.0.0.1:8025/health
curl -sS http://127.0.0.1:8025/token-mask
Run one transcription request:
curl -sS -X POST http://127.0.0.1:8025/asr \
-F file=@/path/to/audio.wav \
-F max_new_tokens=256
OpenAI-style transcription endpoint:
curl -sS -X POST http://127.0.0.1:8025/v1/audio/transcriptions \
-F file=@/path/to/audio.wav \
-F model=ark-asr
Stop the service:
scripts/vllm/deploy_ark_asr_vllm_service.sh stop
The vLLM adapter registers the custom arkasr model, loads the local
processor/tokenizer with trust_remote_code=True, applies generation-time
token masking for non-ASR control tokens, and keeps <|im_end|> as the stop
token. Service logs and PID files are written under runs/vllm/.
Evaluation
The repository also includes a J/WER evaluation entrypoint:
python scripts/eval/eval_jwer_ark_asr_transformers.py \
--input /path/to/test.jsonl \
--output runs/eval/result.jsonl \
--model_path AutoArk-AI/ARK-ASR-0.6B \
--processor_path AutoArk-AI/ARK-ASR-0.6B \
--batch_size 40 \
--dtype float16 \
--attn_impl sdpa
No evaluation audio or dataset files are bundled with this model repository.
Acknowledgements
The training code is based on THUNLP/OPD and verl. The OPD recipe uses a stronger ASR teacher to score online student rollouts.
Citation
If you find ARK-ASR or open-audio-opd useful, please cite:
@misc{lin2026dataefficientopd,
title={Data-Efficient On-Policy Distillation for Automatic Speech Recognition},
author={Lin, Yu and Wang, Yiming and Cai, Runyuan and Zeng, Xiaodong},
year={2026},
eprint={2605.28139},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2605.28139}
}
- Downloads last month
- 444
Space using AutoArk-AI/ARK-ASR-0.6B 1
Paper for AutoArk-AI/ARK-ASR-0.6B
Evaluation results
- Mean Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
6.03 - Rtfx on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
132.55 - Ami Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
11.54 - Earnings22 Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
10.07 - Gigaspeech Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
8.95 - Librispeech Clean Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
1.87 - Librispeech Other Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
3.89 - Spgispeech Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
2.89