-
Notifications
You must be signed in to change notification settings - Fork 6
Labels
Milestone
Description
Idea
There are a lot of different SMILES that describe the same molecule. The SMILES you get depends on different factors: where do you start the traversal? Which direction do you use for rings? How do you sort branches?
So far, we only use the SMILES provided by ChEBI. However, we could get more different SMILES out of ChEBI by applying SMILES augmentation, i.e., adding several SMILES to the data set for each molecule. This can potentially reduce overfitting and help generalisation.
Current state
In PR #52 SMILES augmentation has already been implemented and shown to be successful see wandb. PR #52 is however outdated.
Todo
- On a new branch, take the relevant features from Feature/data augmentation #52 and add them to the current chebai
- Use rdkit to generate new SMILES with
rootedAtAtomanddoRandom: rdkit -> MolToSmiles - The user should be able to specify the number of different SMILES per molecule (as a maximum value, not all molecules might be able to reach that maximum)