-
Notifications
You must be signed in to change notification settings - Fork 23
Description
I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:
English sentences:
Terry's side fell to their second Premier League loss of the season at Loftus Road
Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.
Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.
Translations:
طرفدار تری در فوتبال دوم فصل در لئوپوس رود به
پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ
مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه
which according to Google Translate translates back to this:
More fans in the second football season in Leopard
After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh
Mark Wells has been a writer and broadcaster for over a decade
I can't find any configuration settings that would be limiting the number of tokens being translated
Here is my code:
#!/usr/bin/python3
import sys
#from transformers import MarianTokenizer, MarianMTModel
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from typing import List
import torch
device = "cuda:0"
dir=sys.argv[1] + "persiannlp"
size="base"
mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'
tokenizer = MT5Tokenizer.from_pretrained(mname)
model = MT5ForConditionalGeneration.from_pretrained(mname)
model = model.to(device)
lines = []
while True:
for line in sys.stdin:
line = line.strip()
if line == 'EOD':
inputs = tokenizer(lines, return_tensors="pt", padding=True).to(device)
translated = model.generate(**inputs).to(device)
[print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
print('EOL')
sys.stdout.flush()
lines.clear()
elif line.startswith('EOF'):
sys.exit(0)
else:
lines.append(line)
sys.exit(0)