Skip to content

Conversation

@lilyminium
Copy link
Contributor

@lilyminium lilyminium commented Apr 22, 2025

Just opening as a draft PR to show some prototype code -- would it be possible to speed up the inchi mapping at all? I put in a quick progress bar in the original method and it can take anywhere between 2-5 hours on the filtered Industry Benchmark Set, which is quite a while. If inchi is a one-to-one mapping to molecule ID as implied by the original implementation, could we just pull everything out at once and group by molecule ID?

Edit: I haven't tested this at all, just spitballing some ideas

I added a small regression test, and the optimize_mm test will also call this method. This is near-instant on my laptop vs 1-2 hours for the older method (I suspect file i/o issues on HPC3 causing the 5+ hours remotely).

@codecov-commenter
Copy link

codecov-commenter commented Apr 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.94%. Comparing base (46d7229) to head (9530202).
Report is 8 commits behind head on main.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lilyminium lilyminium marked this pull request as ready for review April 22, 2025 21:36
@lilyminium lilyminium marked this pull request as draft April 23, 2025 00:46
@lilyminium
Copy link
Contributor Author

lilyminium commented Apr 23, 2025

Never mind -- this new method finds 72105 records, the older one 72150. Not sure what's going wrong here.

Copy link
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great;

  • the code change is easy to follow once I read through it and remember what it aims to do
  • the performance uplift is not surprising
  • I can see an improvement locally

After spending all too much time trying to get the logger to include timestamps, I just threw this hacky patch on to this branch and main and ran the dead-simple run.py script that's been chilling in the root of this repo for a while.

This brings that step from 350 ms down to 70 ms, albeit on a smaller dataset of 1611 conformers. I have no idea what the speedup factor is on larger systems, but I haven't tested it on systems of the size you care about. At very least I doubt this is a slowdown, if I correctly understand how it's accessing the database

If this ends up being a standalone improvement - awesome, we should get this through. If you want this as part of some larger changes, or we need to do some more exhaustive profiling, let's hold off?

@lilyminium
Copy link
Contributor Author

lilyminium commented Apr 23, 2025

I think there's problems in this PR -- I'm interpreting the drop in number of records as reflecting that there are multiple molecule IDs for the same InChI key and that lines 552-555 are just overwriting some of those inchi keys. Looking at the code this dictionary actually gets flattened out again (below), so IIUC that means there's actually 45 records that don't get optiimized now.

inputs = [
MinimizationInput(
inchi_key=inchi_key,
qcarchive_id=row["qcarchive_id"],
force_field=force_field,
mapped_smiles=row["mapped_smiles"],
coordinates=row["coordinates"],
)
for inchi_key in input
for row in input[inchi_key]
]

I think this makes sense since I think mapped_smiles are used to determine different molecule IDs, and multiple mapped_smiles can map onto the same inchi. That being said I'm not sure exactly what purpose the inchi mapping actually does and I wonder if that could even be refactored out.

Edit: (so TL;DR -- yes let's hold off, thanks so much for the quick review in your evening!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants