Poor performance

#6
by chatboo - opened

I've tried this model to replace parakeet tdt 0.6b and marblenet frame vad. This model seems to be too poor to use in any real application. Can anyone confirm similar experiences? Getting high WER and very high false positives on the EOU. Cheers

Problem: The ASR was processing audio in independent 1-second chunks and discarding context after each transcription, so the model never saw complete sentences. It only saw fragments like “Who is the” and “Australia” separately, which caused poor transcription accuracy and multiple false end-of-utterance detections.

Fix: The implementation was changed to accumulate audio continuously in a growing buffer and only transcribe when natural silence of about 800 milliseconds or more is detected after speech. This allows the model to see the complete utterance “Who is the Prime Minister of Australia?” as one piece.

Result: The model now has full sentence context for accurate word recognition, and end-of-utterance triggers only once when the speaker actually stops talking, instead of after every one-second chunk.

chatboo changed discussion status to closed

Sign up or log in to comment