-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi Yiyi,
Thank you for your valuable contribution!
I'm currently trying to reproduce your results for the Multilingual ME5-Base model, specifically using the t5_me5_base_mtg_en_fr_de_es_5m_32_corrector checkpoint. However, I'm encountering a few issues around punctuation spacing, and uncertainties about the evaluation setup.
Here's the code snippet I used:
model_path="yiyic/t5_me5_base_mtg_en_fr_de_es_5m_32_corrector"
samples = ["jack morris is a phd student at cornell tech in new york city",
"it was the best of times, it was the worst of times, it was the age of wisdom",
"in einer stunde erreichen wir kopenhagen."]
experiment, trainer = analyze_utils.load_experiment_and_trainer_from_pretrained(
model_path, use_less_data=3000)
trainer, device = trainer_attributes(trainer, experiment)
trainer.num_gen_recursive_steps = 10
My output:
[pred] jack morris is a phd student at cornell tech in new york city
[true] jack morris is a phd student at cornell tech in new york city
[pred] it was the best of times , it was the worst of times , es was the ages of wisdom , time
[true] it was the best of times, it was the worst of times, it was the age of wisdom
[pred] in einer stunde erreichen wir kopenhagen.
[true] in einer stunde erreichen wir kopenhagen.
To run the model without retraining, I commented out these lines in experiments.py and replaced the datasets with empty dictionaries in the return statement. I hope this doesn't impact evaluation logic, but please correct me if I'm wrong.
Additionally, I have a couple of questions regarding your evaluation setup:
- Which dataset/split did you use for evaluation in the paper? You mention 2k rows in the paper, but the test splits of yiyic/mtg_de_5m and yiyic/mtg_en_5m on Hugging Face each contain 3k rows. Should I use the validation split (2k rows)? And are these the datasets you used to obtain your evaluation results?
- Is it necessary to truncate input to exactly 32 tokens during evaluation to match the results in the paper?
Using mtg_de_5m validation split and evaluating for 20 steps, I get:
trainer get decoded sequences
[pred] hören sie, wie 75% der roten nottingham forest-fans diese entscheidung genießen sollten: derby sollte das
[true] aufgedeckt: 75% der derby-fans sprechen dieses nottingham forest-urteil aus. die roten sollten dies m
[pred] der brütige junge schauspieler sonny beyga entlassene sich aus der stadt in der kleinen geste
[true] der herzwärtige geste kleiner junge blauer schauspieler sonny beyga hielt sich geheim aus der
[pred] ex-cia-agenten fanden eines der u-boote von pablo escobar, während er seine unglaublich
[true] ex-cia-agenten fanden einen der u-boote von pablo escobar, während er nach seinem
evaluating....
shape : torch.Size([2000, 34])
shape after per_device: torch.Size([256, 34])
saving embeddings for preds and labels ....
{..., 'eval_true_num_tokens': 29.7421875, 'eval_pred_num_tokens': 30.7421875, ... , 'eval_bleu_score': 34.21, 'eval_rouge_score': 0.63, 'eval_exact_match': 0.079, 'eval_emb_cos_sim': 0.9667, ...}
The metrics don't fully align with the reported numbers, and I suspect it might be due to token truncation or differences in the evaluation dataset. Is truncation to exactly 32 tokens required during evaluation? And is the exact dataset used in the paper available publicly?
Any insight you could share would be greatly appreciated!
