add smiles canonicalisation, update tokens.txt #118

sfluegel05 · 2025-08-01T12:05:38Z

This solves #117

in the _read_data method of the ChemDataReader (which is used by all SMILES related datasets), read molecule with rdkit and output new, canonical SMILES
canonicalisation can be disabled by setting reader_kwargs["canonicalize_smiles"] in the dataset to False.
this PR also adds new tokens since many of the tokens in canonical SMILES don't appear in CHEBI (mostly because of changes like -- to-2)

note for #113: SMILES augmentation only works if canonicalisation is set to False

aditya0by0 · 2025-08-09T10:47:53Z

note for #113: SMILES augmentation only works if canonicalisation is set to False

Sure will take this into consideration

#118 (comment)

sfluegel05 added 2 commits August 1, 2025 13:36

add smiles canonicalisation, update tokens.txt

04fd197

add canonicalize flag

e85a9c1

sfluegel05 marked this pull request as ready for review August 1, 2025 12:09

sfluegel05 merged commit ac8cf63 into dev Aug 1, 2025
9 checks passed

aditya0by0 added a commit that referenced this pull request Aug 9, 2025

if aug is true, set reader's canoncialize as False

f127b5e

#118 (comment)

sfluegel05 deleted the feature/canonicalise-smiles branch August 11, 2025 11:53

sfluegel05 mentioned this pull request Nov 17, 2025

SMILES should be canonical (unless augmentation is used or other format is specifically requested) #117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add smiles canonicalisation, update tokens.txt #118

add smiles canonicalisation, update tokens.txt #118

Uh oh!

sfluegel05 commented Aug 1, 2025

Uh oh!

Uh oh!

aditya0by0 commented Aug 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add smiles canonicalisation, update tokens.txt #118

add smiles canonicalisation, update tokens.txt #118

Uh oh!

Conversation

sfluegel05 commented Aug 1, 2025

Uh oh!

Uh oh!

aditya0by0 commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya0by0 commented Aug 9, 2025 •

edited

Loading