Machine translation - long sentences cause incomplete translation

I'm translating English sentences into Farsi with mt5-base-parsinlu-translation_en_fa (from Huggingface). Sentences longer than around 8 words result in the translation of the first part of the sentence, but the rest of the sentence is ignored. For example:

English sentences:

Terry's side fell to their second Premier League loss of the season at Loftus Road

Following a four-day hiatus, UN envoy Ismail Ould Cheikh Ahmed on Thursday will resume mediation efforts in the second round of Kuwait-hosted peace talks between Yemen’s warring rivals.

Mark Woods is a writer and broadcaster who has covered the NBA, and British basketball, for over a decade.

Translations:

طرفدار تری در فوتبال دوم فصل در لئوپوس رود به 

پس از چهار روز توقف، سفیر سازمان ملل، ایمیل اولد شیخ 

مارک ولز نویسنده و پخش کننده ای است که بیش از یک دهه


which according to Google Translate translates back to this:


More fans in the second football season in Leopard

After a four-day hiatus, the ambassador to the United Nations, Old Sheikh Sheikh

Mark Wells has been a writer and broadcaster for over a decade


I can't find any configuration settings that would be limiting the number of tokens being translated
Here is my code:

    #!/usr/bin/python3
    import sys
    #from transformers import MarianTokenizer, MarianMTModel
    from transformers import MT5ForConditionalGeneration, MT5Tokenizer
    from typing import List
    import torch

    device = "cuda:0"

    dir=sys.argv[1] + "persiannlp"
    size="base"
    mname = f'{dir}/data/mt5-{size}-parsinlu-translation_en_fa'

    tokenizer = MT5Tokenizer.from_pretrained(mname)
    model = MT5ForConditionalGeneration.from_pretrained(mname)
    model = model.to(device)

    lines = [] 
    while True:
        for line in sys.stdin:
            line = line.strip()
            if line == 'EOD':
                inputs    = tokenizer(lines, return_tensors="pt", padding=True).to(device)
                translated   = model.generate(**inputs).to(device)
                [print(tokenizer.decode(t, skip_special_tokens=True)) for t in translated]
                print('EOL')
                sys.stdout.flush()
                lines.clear()
            elif line.startswith('EOF'):
                sys.exit(0)
            else:
                lines.append(line)
    sys.exit(0)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine translation - long sentences cause incomplete translation #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Machine translation - long sentences cause incomplete translation #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions