Reproducing results Multilingual ME5-Base Model

Hi Yiyi,

Thank you for your valuable contribution!

I'm currently trying to reproduce your results for the Multilingual ME5-Base model, specifically using the `t5_me5_base_mtg_en_fr_de_es_5m_32_corrector` checkpoint. However, I'm encountering a few issues around punctuation spacing, and uncertainties about the evaluation setup.

Here's the code snippet I used:

```
model_path="yiyic/t5_me5_base_mtg_en_fr_de_es_5m_32_corrector"

samples = ["jack morris is a phd student at cornell tech in new york city",
"it was the best of times, it was the worst of times, it was the age of wisdom",
"in einer stunde erreichen wir kopenhagen."]

experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
        model_path, use_less_data=3000)

trainer, device = trainer_attributes(trainer, experiment)
trainer.num_gen_recursive_steps = 10
```

My output: 
```
[pred] jack morris is a phd student at cornell tech in new york city
[true] jack morris is a phd student at cornell tech in new york city


[pred]  it was the best of times , it was the worst of times , es was the ages of wisdom , time 
[true] it was the best of times, it was the worst of times, it was the age of wisdom


[pred] in einer stunde erreichen wir kopenhagen.
[true] in einer stunde erreichen wir kopenhagen.
```
To run the model without retraining, I commented out [these lines](https://github.com/siebeniris/MultiVec2Text/blob/48d105866b04aa1e56c56625a7364a04be53bbd4/vec2text/experiments.py#L734C9-L738C10) in experiments.py and replaced the datasets with empty dictionaries in the return statement.  I hope this doesn't impact evaluation logic, but please correct me if I'm wrong.

Additionally, I have a couple of questions regarding your evaluation setup:

- Which dataset/split did you use for evaluation in the paper? You mention 2k rows in the paper, but the test splits of [yiyic/mtg_de_5m](https://huggingface.co/datasets/yiyic/mtg_de_5m) and [yiyic/mtg_en_5m](https://huggingface.co/datasets/yiyic/mtg_en_5m) on Hugging Face each contain 3k rows. Should I use the validation split (2k rows)? And are these the datasets you used to obtain your evaluation results? 
- Is it necessary to truncate input to exactly 32 tokens during evaluation to match the results in the paper?

Using mtg_de_5m validation split and evaluating for 20 steps, I get:
```
trainer get decoded sequences

[pred] hören sie, wie 75% der roten nottingham forest-fans diese entscheidung genießen sollten: derby sollte das 
[true] aufgedeckt: 75% der derby-fans sprechen dieses nottingham forest-urteil aus. die roten sollten dies m



[pred] der brütige junge schauspieler sonny beyga entlassene sich aus der stadt in der kleinen geste
[true] der herzwärtige geste kleiner junge blauer schauspieler sonny beyga hielt sich geheim aus der



[pred] ex-cia-agenten fanden eines der u-boote von pablo escobar, während er seine unglaublich
[true] ex-cia-agenten fanden einen der u-boote von pablo escobar, während er nach seinem

evaluating....
shape :  torch.Size([2000, 34])
shape after per_device:  torch.Size([256, 34])
saving embeddings for preds and labels ....

{..., 'eval_true_num_tokens': 29.7421875, 'eval_pred_num_tokens': 30.7421875, ... , 'eval_bleu_score': 34.21, 'eval_rouge_score': 0.63, 'eval_exact_match': 0.079, 'eval_emb_cos_sim': 0.9667, ...}
```

The metrics don't fully align with the reported numbers, and I suspect it might be due to token truncation or differences in the evaluation dataset. Is truncation to exactly 32 tokens required during evaluation? And is the exact dataset used in the paper available publicly?

![Image](https://github.com/user-attachments/assets/3486d1e6-b836-48e3-b711-b85ffa0e3c1f)

Any insight you could share would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducing results Multilingual ME5-Base Model #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reproducing results Multilingual ME5-Base Model #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions