Pruned Stateless Zipformer RNN-T Streaming Robust ES v0
Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:
Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. ["w", "ɑ", "ʃ", "i", "ɑ"]. Therefore, the model's vocabulary contains the different IPA phonemes found in gruut.
This model was trained using icefall framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.
Setup
To set up all the necessary packages, please follow the installation instructions from the official icefall documentation.
When cloning the icefall repo, make sure to clone our fork of icefall git clone https://github.com/bookbot-hive/icefall instead of the original.
Download Pre-trained Model
Once you've installed all the necessary packages, follow the steps below
cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..
Evaluation Results
Chunk-wise Streaming
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/streaming_decode.py \
--epoch 80 \
--avg 5 \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--exp-dir tmp/zipformer-streaming-robust-es-v0/ \
--use-transducer True \
--decoding-method $m \
--num-decode-streams 1000
done
The model achieves the following phoneme error rates on the different test sets:
| Decoding | Common Voice 23.0 ES | SLR72 |
|---|---|---|
| Fast Beam Search | 5.57% | 2.18% |
| Greedy Search | 2.85% | 1.56% |
| Modified Beam Search | 2.71% | 1.47% |
Usage
Inference
To decode with greedy search, run:
./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
--nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
--tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
Decoding Output
2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done
Training procedure
Install icefall
git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH
Prepare Data
cd egs/bookbot_es/ASR
./prepare.sh
Train
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
--world-size 2 \
--num-epochs 80 \
--exp-dir tmp/exp-causal \
--causal 1 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--max-duration 1000 \
--base-lr 0.04 \
--use-transducer True \
--use-fp16 1
Exporting to ONNX
To export the trained model to onnx run:
./zipformer/export-onnx-streaming.py \
--tokens data/lang_phone/tokens.txt \
--avg 5 \
--causal 1 \
--exp-dir tmp/zipformer-streaming-robust-es-v0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--chunk-size 16 \
--left-context-frames 128 \
--use-transducer True \
--epoch 80 \
It will store the ONNX files inside the specified exp-dir.
Converting ONNX to ORT
cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .
Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:
Standard ORT files:
encoder-epoch-80-avg-5-chunk-16-left-128.ortdecoder-epoch-80-avg-5-chunk-16-left-128.ortjoiner-epoch-80-avg-5-chunk-16-left-128.ort
INT8 Quantized ORT files:
encoder-epoch-80-avg-5-chunk-16-left-128.int8.ortdecoder-epoch-80-avg-5-chunk-16-left-128.int8.ortjoiner-epoch-80-avg-5-chunk-16-left-128.int8.ort