Poor performance

by chatboo - opened Nov 28, 2025

Nov 28, 2025

I've tried this model to replace parakeet tdt 0.6b and marblenet frame vad. This model seems to be too poor to use in any real application. Can anyone confirm similar experiences? Getting high WER and very high false positives on the EOU. Cheers

chatboo

Nov 28, 2025

Problem: The ASR was processing audio in independent 1-second chunks and discarding context after each transcription, so the model never saw complete sentences. It only saw fragments like “Who is the” and “Australia” separately, which caused poor transcription accuracy and multiple false end-of-utterance detections.

Fix: The implementation was changed to accumulate audio continuously in a growing buffer and only transcribe when natural silence of about 800 milliseconds or more is detected after speech. This allows the model to see the complete utterance “Who is the Prime Minister of Australia?” as one piece.

Result: The model now has full sentence context for accurate word recognition, and end-of-utterance triggers only once when the speaker actually stops talking, instead of after every one-second chunk.

chatboo changed discussion status to closed Nov 28, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment