IWSLT/ted_talks_iwslt
Updated β’ 603 β’ 24
How to use dhintech/marian-tedtalks-id-en with Transformers:
# Use a pipeline as a high-level helper
# Warning: Pipeline type "translation" is no longer supported in transformers v5.
# You must load the model directly (see below) or downgrade to v4.x with:
# 'pip install "transformers<5.0.0'
from transformers import pipeline
pipe = pipeline("translation", model="dhintech/marian-tedtalks-id-en") # Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("dhintech/marian-tedtalks-id-en")
model = AutoModelForSeq2SeqLM.from_pretrained("dhintech/marian-tedtalks-id-en")This model is an optimized fine-tuned version of Helsinki-NLP/opus-mt-id-en specifically designed for real-time meeting translation from Indonesian to English.
from transformers import MarianMTModel, MarianTokenizer
# Load model and tokenizer
model_name = "dhintech/marian-tedtalks-id-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Translate Indonesian to English
def translate(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=96)
outputs = model.generate(
**inputs,
max_length=96,
num_beams=3, # Optimized for speed
early_stopping=True,
do_sample=False
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
indonesian_text = "Selamat pagi, mari kita mulai rapat hari ini."
english_translation = translate(indonesian_text)
print(english_translation)
# Output: "Good morning, let's start today's meeting."
import time
from transformers import MarianMTModel, MarianTokenizer
import torch
class OptimizedMeetingTranslator:
def __init__(self, model_name="dhintech/marian-tedtalks-id-en"):
self.tokenizer = MarianTokenizer.from_pretrained(model_name)
self.model = MarianMTModel.from_pretrained(model_name)
# Optimize for inference
self.model.eval()
if torch.cuda.is_available():
self.model = self.model.cuda()
def translate(self, text, max_length=96):
start_time = time.time()
inputs = self.tokenizer(
text,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
if torch.cuda.is_available():
inputs = {k: v.cuda() for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_length=max_length,
num_beams=3,
early_stopping=True,
do_sample=False,
pad_token_id=self.tokenizer.pad_token_id
)
translation = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
translation_time = time.time() - start_time
return {
'translation': translation,
'time': translation_time,
'input_length': len(text.split()),
'output_length': len(translation.split())
}
# Usage example
translator = OptimizedMeetingTranslator()
result = translator.translate("Apakah ada pertanyaan mengenai proposal ini?")
print(f"Translation: {result['translation']}")
print(f"Time: {result['time']:.3f}s")
def batch_translate(sentences, translator):
results = []
total_time = 0
for sentence in sentences:
result = translator.translate(sentence)
results.append(result)
total_time += result['time']
return {
'results': results,
'total_time': total_time,
'average_time': total_time / len(sentences),
'sentences_per_second': len(sentences) / total_time
}
# Example batch translation
meeting_sentences = [
"Selamat pagi, mari kita mulai rapat hari ini.",
"Apakah ada pertanyaan mengenai proposal ini?",
"Tim marketing akan bertanggung jawab untuk strategi ini.",
"Mari kita diskusikan timeline implementasi project ini."
]
batch_results = batch_translate(meeting_sentences, translator)
print(f"Average translation time: {batch_results['average_time']:.3f}s")
print(f"Throughput: {batch_results['sentences_per_second']:.1f} sentences/second")
| Indonesian | English | Context |
|---|---|---|
| Selamat pagi, mari kita mulai rapat hari ini. | Good morning, let's start today's meeting. | Meeting Opening |
| Apakah ada pertanyaan mengenai proposal ini? | Are there any questions about this proposal? | Q&A Session |
| Tim marketing akan bertanggung jawab untuk strategi ini. | The marketing team will be responsible for this strategy. | Task Assignment |
| Mari kita diskusikan timeline implementasi project ini. | Let's discuss the implementation timeline for this project. | Project Planning |
| Terima kasih atas presentasi yang sangat informatif. | Thank you for the very informative presentation. | Appreciation |
| Indonesian | English | Context |
|---|---|---|
| Teknologi AI berkembang sangat pesat di Indonesia. | AI technology is developing very rapidly in Indonesia. | Tech Discussion |
| Mari kita analisis data performa bulan lalu. | Let's analyze last month's performance data. | Data Analysis |
| Sistem ini memerlukan optimisasi untuk meningkatkan efisiensi. | This system needs optimization to improve efficiency. | Technical Review |
@misc{marian-id-en-optimized-2025,
title={MarianMT Indonesian-English Translation (Optimized for Real-Time Meetings)},
author={DhinTech},
year={2025},
publisher={Hugging Face},
journal={Hugging Face Model Hub},
howpublished={\url{https://huggingface.co/dhintech/marian-tedtalks-id-en}},
note={Fine-tuned on TED Talks corpus with meeting-specific optimizations}
}
We welcome contributions to improve this model:
This model is specifically optimized for Indonesian business meeting translation scenarios. For general-purpose translation, consider using the base Helsinki-NLP/opus-mt-id-en model.
Base model
Helsinki-NLP/opus-mt-id-en