Computation of penalized logP in the ZINC dataset. #9004
Unanswered
KemperNiklas
asked this question in
Q&A
Replies: 1 comment
-
Thanks for this insightful issue. The description and the source of the dataset is taken from https://arxiv.org/pdf/2003.00982v5.pdf. How do you suggest that we update the documentation? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The pytorch_geometric documentation and many other sources say that the plogp score in ZINC is computed as:
plogp = logP - SAS - cycles
where
cycles
is a penalty for cycles larger than 6.While I couldn't find the original source that computes plogp for the ZINC dataset, it seems like the following code was used:
At least this gives comparable results to the ground truth in the pyg dataset.
This is different from the documented formula as the cycle penalty is computed based on the largest cycle of a cycle basis, leading to weird examples like the following (I colored the cycles of the cycle basis that are used in the reward_penalized_log_p function):
plogp.pdf
Molecule b is from the ZINC dataset and the cycle penalty is computed based on the 12-cycle, which results in a huge normalized penalty of -21.
Molecule a is hand-crafted by me. Here, the cycle cover consists of just three 6-cycles, which results in a normalized penalty of 0.2 despite having a similar cyclic structure to molecule b. Which cycle cover is used depends on the implementation of the nx cycle basis function.
Maybe I am missing something, but it seems to me that the ground truth of the ZINC dataset is not following the documentation. The used implementation just adds a lot of noise and makes it harder to test if models actually learn chemically relevant information.
So, if I am not mistaken, it would probably be best to correct the documentation? Or provide a new dataset with the corrected cycle penalty?
Thanks in advance for any help in this!
Beta Was this translation helpful? Give feedback.
All reactions