Pruned Stateless Zipformer RNN-T Streaming Robust ES v0

Pruned Stateless Zipformer RNN-T Streaming Robust ES v0 is a Spanish automatic speech recognition model trained on the following datasets:

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. ["w", "ɑ", "ʃ", "i", "ɑ"]. Therefore, the model's vocabulary contains the different IPA phonemes found in gruut.

This model was trained using icefall framework. All training was done on 2 NVIDIA RTX 4090 GPUs. All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

Setup

To set up all the necessary packages, please follow the installation instructions from the official icefall documentation. When cloning the icefall repo, make sure to clone our fork of icefall git clone https://github.com/bookbot-hive/icefall instead of the original.

Download Pre-trained Model

Once you've installed all the necessary packages, follow the steps below

cd egs/bookbot_es/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/zipformer-streaming-robust-es-v0/
cd ..

Evaluation Results

Chunk-wise Streaming

for m in greedy_search fast_beam_search modified_beam_search; do
  ./zipformer/streaming_decode.py \
    --epoch 80 \
    --avg 5 \
    --causal 1 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0/ \
    --use-transducer True \
    --decoding-method $m \
    --num-decode-streams 1000
done

The model achieves the following phoneme error rates on the different test sets:

Decoding	Common Voice 23.0 ES	SLR72
Fast Beam Search	5.57%	2.18%
Greedy Search	2.85%	1.56%
Modified Beam Search	2.71%	1.47%

Usage

Inference

To decode with greedy search, run:

./tmp/zipformer-streaming-robust-es-v0/jit_pretrained_streaming.py \
  --nn-model-filename ./tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt \
  --tokens ./tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt \
  ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav

Decoding Output

2025-11-18 01:52:34,422 INFO [jit_pretrained_streaming.py:175] {'nn_model_filename': './tmp/zipformer-streaming-robust-es-v0/jit_script_chunk_16_left_128.pt', 'tokens': './tmp/zipformer-streaming-robust-es-v0/data/lang_phone/tokens.txt', 'sample_rate': 16000, 'sound_file': './tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav'}
2025-11-18 01:52:34,426 INFO [jit_pretrained_streaming.py:181] device: cuda:0
2025-11-18 01:52:35,082 INFO [jit_pretrained_streaming.py:194] Constructing Fbank computer
2025-11-18 01:52:35,083 INFO [jit_pretrained_streaming.py:197] Reading sound files: ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:202] torch.Size([114688])
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:204] Decoding started
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:209] chunk_length: 32
2025-11-18 01:52:35,090 INFO [jit_pretrained_streaming.py:210] T: 45
2025-11-18 01:52:35,105 INFO [jit_pretrained_streaming.py:226] 0/119488
2025-11-18 01:52:35,117 INFO [jit_pretrained_streaming.py:226] 4000/119488
2025-11-18 01:52:35,453 INFO [jit_pretrained_streaming.py:226] 8000/119488
2025-11-18 01:52:35,454 INFO [jit_pretrained_streaming.py:226] 12000/119488
2025-11-18 01:52:35,475 INFO [jit_pretrained_streaming.py:226] 16000/119488
2025-11-18 01:52:35,503 INFO [jit_pretrained_streaming.py:226] 20000/119488
2025-11-18 01:52:35,536 INFO [jit_pretrained_streaming.py:226] 24000/119488
2025-11-18 01:52:35,548 INFO [jit_pretrained_streaming.py:226] 28000/119488
2025-11-18 01:52:35,549 INFO [jit_pretrained_streaming.py:226] 32000/119488
2025-11-18 01:52:35,561 INFO [jit_pretrained_streaming.py:226] 36000/119488
2025-11-18 01:52:35,588 INFO [jit_pretrained_streaming.py:226] 40000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 44000/119488
2025-11-18 01:52:35,612 INFO [jit_pretrained_streaming.py:226] 48000/119488
2025-11-18 01:52:35,644 INFO [jit_pretrained_streaming.py:226] 52000/119488
2025-11-18 01:52:35,682 INFO [jit_pretrained_streaming.py:226] 56000/119488
2025-11-18 01:52:35,694 INFO [jit_pretrained_streaming.py:226] 60000/119488
2025-11-18 01:52:35,714 INFO [jit_pretrained_streaming.py:226] 64000/119488
2025-11-18 01:52:35,717 INFO [jit_pretrained_streaming.py:226] 68000/119488
2025-11-18 01:52:35,734 INFO [jit_pretrained_streaming.py:226] 72000/119488
2025-11-18 01:52:35,748 INFO [jit_pretrained_streaming.py:226] 76000/119488
2025-11-18 01:52:35,765 INFO [jit_pretrained_streaming.py:226] 80000/119488
2025-11-18 01:52:35,767 INFO [jit_pretrained_streaming.py:226] 84000/119488
2025-11-18 01:52:35,780 INFO [jit_pretrained_streaming.py:226] 88000/119488
2025-11-18 01:52:35,794 INFO [jit_pretrained_streaming.py:226] 92000/119488
2025-11-18 01:52:35,808 INFO [jit_pretrained_streaming.py:226] 96000/119488
2025-11-18 01:52:35,822 INFO [jit_pretrained_streaming.py:226] 100000/119488
2025-11-18 01:52:35,823 INFO [jit_pretrained_streaming.py:226] 104000/119488
2025-11-18 01:52:35,837 INFO [jit_pretrained_streaming.py:226] 108000/119488
2025-11-18 01:52:35,850 INFO [jit_pretrained_streaming.py:226] 112000/119488
2025-11-18 01:52:35,864 INFO [jit_pretrained_streaming.py:226] 116000/119488
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:256] ./tmp/zipformer-streaming-robust-es-v0/test_waves/sample_1.wav
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:257] elgobʝeɾnopwestoadisposiθʝondelapoblaθʝonlosmedʝosneθesaɾʝospaɾalareubikaθʝondelasbiktimas
2025-11-18 01:52:35,866 INFO [jit_pretrained_streaming.py:259] Decoding Done

Training procedure

Install icefall

git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH

Prepare Data

cd egs/bookbot_es/ASR
./prepare.sh

Train

export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
  --world-size 2 \
  --num-epochs 80 \
  --exp-dir tmp/exp-causal \
  --causal 1 \
  --num-encoder-layers 2,2,2,2,2,2 \
  --feedforward-dim 512,768,768,768,768,768 \
  --encoder-dim 192,256,256,256,256,256 \
  --encoder-unmasked-dim 192,192,192,192,192,192 \
  --max-duration 1000 \
  --base-lr 0.04 \
  --use-transducer True \
  --use-fp16 1

Exporting to ONNX

To export the trained model to onnx run:

./zipformer/export-onnx-streaming.py \
    --tokens data/lang_phone/tokens.txt \
    --avg 5 \
    --causal 1 \
    --exp-dir tmp/zipformer-streaming-robust-es-v0 \
    --num-encoder-layers 2,2,2,2,2,2 \
    --feedforward-dim 512,768,768,768,768,768 \
    --encoder-dim 192,256,256,256,256,256 \
    --encoder-unmasked-dim 192,192,192,192,192,192 \
    --chunk-size 16 \
    --left-context-frames 128 \
    --use-transducer True \
    --epoch 80 \

It will store the ONNX files inside the specified exp-dir.

Converting ONNX to ORT

cd tmp/zipformer-streaming-robust-es-v0
python -m onnxruntime.tools.convert_onnx_models_to_ort --optimization_style=Fixed .

Upon running the code above, it will convert the ONNX files to the ORT format along with the efficient int8 quantized versions. The following files will be generated:

Standard ORT files:

encoder-epoch-80-avg-5-chunk-16-left-128.ort
decoder-epoch-80-avg-5-chunk-16-left-128.ort
joiner-epoch-80-avg-5-chunk-16-left-128.ort

INT8 Quantized ORT files:

encoder-epoch-80-avg-5-chunk-16-left-128.int8.ort
decoder-epoch-80-avg-5-chunk-16-left-128.int8.ort
joiner-epoch-80-avg-5-chunk-16-left-128.int8.ort

Frameworks

Downloads last month: -; Downloads are not tracked for this model. How to track

bookbot
/

zipformer-streaming-robust-es-v0