Skip to content

Machine translation - long sentences cause incomplete translation #32

@gregorybrooks

Description

@gregorybrooks

I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:

English sentences:

Terry's side fell to their second Premier League loss of the season at Loftus Road

Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.

Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.

Translations:

طرفدار تری در فوتبال دوم فصل در لئوپوس رود به

پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ

مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه

which according to Google Translate translates back to this:

More fans in the second football season in Leopard

After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh

Mark Wells has been a writer and broadcaster for over a decade

I can't find any configuration settings that would be limiting the number of tokens being translated
Here is my code:

#!/usr/bin/python3
import sys
#from transformers import MarianTokenizer, MarianMTModel
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
from typing import List
import torch

device = "cuda:0"

dir=sys.argv[1] + "persiannlp"
size="base"
mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'

tokenizer = MT5Tokenizer.from_pretrained(mname)
model = MT5ForConditionalGeneration.from_pretrained(mname)
model = model.to(device)

lines = [] 
while True:
    for line in sys.stdin:
        line = line.strip()
        if line == 'EOD':
            inputs    = tokenizer(lines, return_tensors="pt", padding=True).to(device)
            translated   = model.generate(**inputs).to(device)
            [print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
            print('EOL')
            sys.stdout.flush()
            lines.clear()
        elif line.startswith('EOF'):
            sys.exit(0)
        else:
            lines.append(line)
sys.exit(0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions