Add `n_token_limit` to dataset #79

sfluegel05 · 2025-04-07T08:50:51Z

I added a new parameter to the base dataset. If this parameter is set, after tokenisation, all instances in the dataset will be removed that have more than n_token_limit tokens. This allows to train models with max_position_embeddings=n_token_limit+1. For instance, in ChEBI, 99% of instances have less than 300 SMILES tokens. Using that as a limit allows to reduce the max_position_embeddings from 1800 (current default) to 301. This is useful for training runs as it allows as higher batch size and efficient memory usage. For "production models", I would recommend a higher number (e.g. 600 or 900)

…s that have more than x tokens

sfluegel05 added 2 commits April 3, 2025 16:15

add a parameter to the dataset that (if set), throws out all instance…

5870948

…s that have more than x tokens

reformat using black

0a62a95

sfluegel05 merged commit 1323a18 into dev Apr 7, 2025
6 checks passed

sfluegel05 deleted the feature/maxlen-dataset branch April 7, 2025 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `n_token_limit` to dataset #79

Add `n_token_limit` to dataset #79

Uh oh!

sfluegel05 commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add n_token_limit to dataset #79

Add n_token_limit to dataset #79

Uh oh!

Conversation

sfluegel05 commented Apr 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `n_token_limit` to dataset #79

Add `n_token_limit` to dataset #79