Skip to content

Remove clearance_hepatocyte_az dataset with duplicate entries#524

Merged
j-adamczyk merged 1 commit intoMLCIL:masterfrom
LiudengZhang:fix/remove-clearance-hepatocyte-az-dataset
Mar 10, 2026
Merged

Remove clearance_hepatocyte_az dataset with duplicate entries#524
j-adamczyk merged 1 commit intoMLCIL:masterfrom
LiudengZhang:fix/remove-clearance-hepatocyte-az-dataset

Conversation

@LiudengZhang
Copy link
Contributor

Summary

Removed from: loader function, __init__.py exports, benchmark mappings, dataset name lists, tests, and Sphinx docs.

Closes #461

Test plan

  • Verify no remaining references to clearance_hepatocyte_az in codebase
  • Existing tests for other datasets still pass
  • load_tdc_benchmark and load_tdc_splits work without the removed dataset

The Clearance AstraZeneca Hepatocyte dataset contains duplicated SMILES
with conflicting labels (1213 entries but only 1020 unique molecules),
making it unreliable for modeling. This is a known upstream TDC issue.

Closes MLCIL#461
@j-adamczyk j-adamczyk merged commit 70bcf92 into MLCIL:master Mar 10, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicated data in Clearance AstraZeneca Hepatocyte

2 participants