Pearmut: Human Evaluation of Translation Made Trivial
Abstract
Pearmut is a platform that simplifies human evaluation in multilingual NLP by providing a lightweight solution for end-to-end evaluation with support for various protocols and learning strategies.
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
Community
Happy to discuss how people human-evaluate multilingual tasks! 🙂
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations (2025)
- JP-TL-Bench: Anchored Pairwise LLM Evaluation for Bidirectional Japanese-English Translation (2026)
- InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages (2025)
- When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content (2025)
- Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies (2025)
- Can QE-informed (Re)Translation lead to Error Correction? (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper