Avoid generation of original SMILES in augmentation #136

aditya0by0 · 2025-12-06T12:21:09Z

See Wiki https://github.com/ChEB-AI/python-chebai/wiki/SMILES-Augmentation#snippet-of-augmented-smiles

sfluegel05

That makes sense. I am wondering why we need the original SMILES at all. Why don't we only do the random SMILES generation (which then might include the original SMILES, but not necessarily)?

aditya0by0 · 2025-12-08T19:31:18Z

The motivation for including the original SMILES strings in the augmented dataset comes from two considerations:

Analogy to data augmentation in Computer Vision
In the vision domain, it is standard practice to include both the original images and their augmented variants (rotations, flips, scaling, color jitter, etc.) when training CNN models. The original samples provide a stable reference distribution, while augmented samples improve robustness.
By analogy, including the original SMILES ensures that the model retains exposure to the true, data distribution of ChEBI, rather than relying solely on augmented variants.
Chemical-domain constraints on SMILES generation
A more domain-specific reason is that some SMILES strings present in ChEBI can be parsed by RDKit but cannot be regenerated by RDKit’s SMILES writing algorithms.
This happens due to differences in how RDKit handles certain structural representations.
For example, RDKit often removes implicit hydrogens or normalizes parts of the representation during canonicalization. As a result, certain SMILES in ChEBI that contain specific forms of implicit hydrogens or uncommon notations will not appear in augmented outputs generated by RDKit.
Therefore, these original SMILES must be included in the training set, because they represent valid chemical structures found in ChEBI but would otherwise be lost during augmentation.

Below is an example illustrating such a case:

ident	name	SMILES
32129	diamminesilver(1+) fluoride	`[F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]`

and check its generated SMILES here: https://github.com/ChEB-AI/python-chebai/wiki/SMILES-Augmentation#snippet-of-augmented-smiles

below program reinforces this theory

import random
from itertools import cycle, permutations, product

from rdkit import Chem

AUG_SMILES_VARIATIONS = 1000000000


def generate_augmented_smiles(smiles: str) -> list[str]:
    mol: Chem.Mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return [smiles]  # if mol is None, return original SMILES

    # sanitization set to False, as it can alter the fragment representation in ways you might not want.
    # As we don’t want RDKit to "fix" fragments, only need the fragments as-is, to generate SMILES strings.
    frags = Chem.GetMolFrags(mol, asMols=True, sanitizeFrags=False)
    augmented = set()

    frag_smiles: list[set] = []
    for frag in frags:
        atom_ids = [atom.GetIdx() for atom in frag.GetAtoms()]
        random.shuffle(atom_ids)  # seed set by lightning
        atom_id_iter = cycle(atom_ids)
        frag_smiles.append(
            {
                Chem.MolToSmiles(frag, rootedAtAtom=next(atom_id_iter), doRandom=True)
                for _ in range(AUG_SMILES_VARIATIONS)
            }
        )
    if len(frags) > 1:
        for perm in permutations(frag_smiles):
            for combo in product(*perm):
                augmented.add(".".join(combo))
                if smiles in augmented:
                    print("Found original SMILES in augmented set.")
                    break
    else:
        augmented = frag_smiles[0]

    if smiles in augmented:
        print("Found original SMILES in augmented set.")
    else:
        print("Original SMILES NOT found in augmented set.")


if __name__ == "__main__":
    test_smiles = "[F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]"
    generate_augmented_smiles(test_smiles)

aditya0by0 added 2 commits December 6, 2025 12:54

avoid generation of original smiles in augmentation

2218fc8

pre-commit formatting

df02f55

aditya0by0 self-assigned this Dec 6, 2025

aditya0by0 added bug Something isn't working bug:fix and removed bug Something isn't working labels Dec 6, 2025

aditya0by0 linked an issue Dec 6, 2025 that may be closed by this pull request

SMILES augmentation #113

Closed

aditya0by0 requested a review from sfluegel05 December 6, 2025 12:24

sfluegel05 reviewed Dec 8, 2025

View reviewed changes

aditya0by0 requested a review from sfluegel05 December 8, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid generation of original SMILES in augmentation #136

Avoid generation of original SMILES in augmentation #136

Uh oh!

aditya0by0 commented Dec 6, 2025

Uh oh!

sfluegel05 left a comment

Uh oh!

aditya0by0 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avoid generation of original SMILES in augmentation #136

Are you sure you want to change the base?

Avoid generation of original SMILES in augmentation #136

Uh oh!

Conversation

aditya0by0 commented Dec 6, 2025

Uh oh!

sfluegel05 left a comment

Choose a reason for hiding this comment

Uh oh!

aditya0by0 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants