Skip to content

Conversation

@sfluegel05
Copy link
Collaborator

I added a new parameter to the base dataset. If this parameter is set, after tokenisation, all instances in the dataset will be removed that have more than n_token_limit tokens. This allows to train models with max_position_embeddings=n_token_limit+1. For instance, in ChEBI, 99% of instances have less than 300 SMILES tokens. Using that as a limit allows to reduce the max_position_embeddings from 1800 (current default) to 301. This is useful for training runs as it allows as higher batch size and efficient memory usage. For "production models", I would recommend a higher number (e.g. 600 or 900)

@sfluegel05 sfluegel05 merged commit 1323a18 into dev Apr 7, 2025
6 checks passed
@sfluegel05 sfluegel05 deleted the feature/maxlen-dataset branch April 7, 2025 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants