Skip to content

Conversation

@aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 self-assigned this Dec 6, 2025
@aditya0by0 aditya0by0 added bug Something isn't working bug:fix and removed bug Something isn't working labels Dec 6, 2025
@aditya0by0 aditya0by0 linked an issue Dec 6, 2025 that may be closed by this pull request
@aditya0by0 aditya0by0 requested a review from sfluegel05 December 6, 2025 12:24
Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I am wondering why we need the original SMILES at all. Why don't we only do the random SMILES generation (which then might include the original SMILES, but not necessarily)?

@aditya0by0
Copy link
Member Author

The motivation for including the original SMILES strings in the augmented dataset comes from two considerations:

  1. Analogy to data augmentation in Computer Vision
    In the vision domain, it is standard practice to include both the original images and their augmented variants (rotations, flips, scaling, color jitter, etc.) when training CNN models. The original samples provide a stable reference distribution, while augmented samples improve robustness.
    By analogy, including the original SMILES ensures that the model retains exposure to the true, data distribution of ChEBI, rather than relying solely on augmented variants.

  2. Chemical-domain constraints on SMILES generation
    A more domain-specific reason is that some SMILES strings present in ChEBI can be parsed by RDKit but cannot be regenerated by RDKit’s SMILES writing algorithms.
    This happens due to differences in how RDKit handles certain structural representations.
    For example, RDKit often removes implicit hydrogens or normalizes parts of the representation during canonicalization. As a result, certain SMILES in ChEBI that contain specific forms of implicit hydrogens or uncommon notations will not appear in augmented outputs generated by RDKit.
    Therefore, these original SMILES must be included in the training set, because they represent valid chemical structures found in ChEBI but would otherwise be lost during augmentation.

Below is an example illustrating such a case:

ident name SMILES
32129 diamminesilver(1+) fluoride [F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]

and check its generated SMILES here: https://github.com/ChEB-AI/python-chebai/wiki/SMILES-Augmentation#snippet-of-augmented-smiles

below program reinforces this theory

import random
from itertools import cycle, permutations, product

from rdkit import Chem

AUG_SMILES_VARIATIONS = 1000000000


def generate_augmented_smiles(smiles: str) -> list[str]:
    mol: Chem.Mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return [smiles]  # if mol is None, return original SMILES

    # sanitization set to False, as it can alter the fragment representation in ways you might not want.
    # As we don’t want RDKit to "fix" fragments, only need the fragments as-is, to generate SMILES strings.
    frags = Chem.GetMolFrags(mol, asMols=True, sanitizeFrags=False)
    augmented = set()

    frag_smiles: list[set] = []
    for frag in frags:
        atom_ids = [atom.GetIdx() for atom in frag.GetAtoms()]
        random.shuffle(atom_ids)  # seed set by lightning
        atom_id_iter = cycle(atom_ids)
        frag_smiles.append(
            {
                Chem.MolToSmiles(frag, rootedAtAtom=next(atom_id_iter), doRandom=True)
                for _ in range(AUG_SMILES_VARIATIONS)
            }
        )
    if len(frags) > 1:
        for perm in permutations(frag_smiles):
            for combo in product(*perm):
                augmented.add(".".join(combo))
                if smiles in augmented:
                    print("Found original SMILES in augmented set.")
                    break
    else:
        augmented = frag_smiles[0]

    if smiles in augmented:
        print("Found original SMILES in augmented set.")
    else:
        print("Original SMILES NOT found in augmented set.")


if __name__ == "__main__":
    test_smiles = "[F-].[H][N]([H])([H])[Ag+][N]([H])([H])[H]"
    generate_augmented_smiles(test_smiles)

@aditya0by0 aditya0by0 requested a review from sfluegel05 December 8, 2025 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SMILES augmentation

3 participants