Skip to content

Conversation

@sfluegel05
Copy link
Collaborator

This solves #117

  • in the _read_data method of the ChemDataReader (which is used by all SMILES related datasets), read molecule with rdkit and output new, canonical SMILES
  • canonicalisation can be disabled by setting reader_kwargs["canonicalize_smiles"] in the dataset to False.
  • this PR also adds new tokens since many of the tokens in canonical SMILES don't appear in CHEBI (mostly because of changes like -- to-2)

note for #113: SMILES augmentation only works if canonicalisation is set to False

@sfluegel05 sfluegel05 marked this pull request as ready for review August 1, 2025 12:09
@sfluegel05 sfluegel05 merged commit ac8cf63 into dev Aug 1, 2025
9 checks passed
@aditya0by0
Copy link
Member

aditya0by0 commented Aug 9, 2025

note for #113: SMILES augmentation only works if canonicalisation is set to False

Sure will take this into consideration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants