Skip to content

Conversation

@l-k-11235
Copy link
Contributor

@l-k-11235 l-k-11235 commented Jan 17, 2025

This PRs adapts the ScoringPreparator for LM architecture and fixes the ignore_prompt method of the loss for LM validation with left padding.
We noticed that the filtertoolong transform should not be used along with the huggingface_tokenize transform. See
#191

kwargs = {}
string = string.replace(DefaultTokens.SEP, "\n").replace(
DefaultTokens.MASK_BEFORE, self.tokenizers[side].pad_token
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we handle it the same way for other tokenizers ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only tested for that one (with eurollm) at this point; an error will be raised with the others.

Copy link
Member

@francoishernandez francoishernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird to have such very specific changes in scoring_utils (response_patterns handling mostly), but I'm not sure how this can be properly factorized with the current structure. Let's keep it for now and we might reconsider later.

Comment on lines +80 to +103
if is_seq2seq:
predictor = Translator.from_config( # we need to review opt/config stuff in translator
model,
self.vocabs,
predict_config,
model_config,
device_id=gpu_rank,
global_scorer=scorer,
report_align=predict_config.report_align,
report_score=False,
logger=None,
)
else:
predictor = GeneratorLM.from_config(
model,
self.vocabs,
predict_config,
model_config,
device_id=gpu_rank,
global_scorer=scorer,
report_align=predict_config.report_align,
report_score=False,
logger=None,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to just define a predictor_class in the condition, and call predictor_class(*) once, since they should have the same signature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants