Computation of penalized logP in the ZINC dataset. #9004

KemperNiklas · 2024-03-02T14:47:59Z

KemperNiklas
Mar 2, 2024

The pytorch_geometric documentation and many other sources say that the plogp score in ZINC is computed as:
plogp = logP - SAS - cycles
where cycles is a penalty for cycles larger than 6.

While I couldn't find the original source that computes plogp for the ZINC dataset, it seems like the following code was used:

def reward_penalized_log_p(mol):
    """
    Reward that consists of log p penalized by SA and # long cycles,
    as described in (Kusner et al. 2017). Scores are normalized based on the
    statistics of 250k_rndm_zinc_drugs_clean.smi dataset
    :param mol: rdkit mol object
    :return: float
    """
    # normalization constants, statistics from 250k_rndm_zinc_drugs_clean.smi
    logP_mean = 2.4570953396190123
    logP_std = 1.434324401111988
    SA_mean = -3.0525811293166134
    SA_std = 0.8335207024513095
    cycle_mean = -0.0485696876403053
    cycle_std = 0.2860212110245455

    log_p = MolLogP(mol)
    SA = -calculateScore(mol)

    # cycle score
    cycle_list = nx.cycle_basis(nx.Graph(
        Chem.rdmolops.GetAdjacencyMatrix(mol)))
    if len(cycle_list) == 0:
        cycle_length = 0
    else:
        cycle_length = max([len(j) for j in cycle_list])
    if cycle_length <= 6:
        cycle_length = 0
    else:
        cycle_length = cycle_length - 6
    cycle_score = -cycle_length

    normalized_log_p = (log_p - logP_mean) / logP_std
    normalized_SA = (SA - SA_mean) / SA_std
    normalized_cycle = (cycle_score - cycle_mean) / cycle_std

    return normalized_log_p + normalized_SA + normalized_cycle

At least this gives comparable results to the ground truth in the pyg dataset.
This is different from the documented formula as the cycle penalty is computed based on the largest cycle of a cycle basis, leading to weird examples like the following (I colored the cycles of the cycle basis that are used in the reward_penalized_log_p function):

plogp.pdf

Molecule b is from the ZINC dataset and the cycle penalty is computed based on the 12-cycle, which results in a huge normalized penalty of -21.
Molecule a is hand-crafted by me. Here, the cycle cover consists of just three 6-cycles, which results in a normalized penalty of 0.2 despite having a similar cyclic structure to molecule b. Which cycle cover is used depends on the implementation of the nx cycle basis function.

Maybe I am missing something, but it seems to me that the ground truth of the ZINC dataset is not following the documentation. The used implementation just adds a lot of noise and makes it harder to test if models actually learn chemically relevant information.

So, if I am not mistaken, it would probably be best to correct the documentation? Or provide a new dataset with the corrected cycle penalty?

Thanks in advance for any help in this!

rusty1s · 2024-03-05T12:24:20Z

rusty1s
Mar 5, 2024
Maintainer

Thanks for this insightful issue. The description and the source of the dataset is taken from https://arxiv.org/pdf/2003.00982v5.pdf. How do you suggest that we update the documentation?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Computation of penalized logP in the ZINC dataset. #9004

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Computation of penalized logP in the ZINC dataset. #9004

Uh oh!

Uh oh!

KemperNiklas Mar 2, 2024

Replies: 1 comment

Uh oh!

rusty1s Mar 5, 2024 Maintainer

KemperNiklas
Mar 2, 2024

rusty1s
Mar 5, 2024
Maintainer