Rewrite mapping InChI keys? #135

lilyminium · 2025-04-22T16:13:21Z

Just opening as a draft PR to show some prototype code -- would it be possible to speed up the inchi mapping at all? I put in a quick progress bar in the original method and it can take anywhere between 2-5 hours on the filtered Industry Benchmark Set, which is quite a while. If inchi is a one-to-one mapping to molecule ID as implied by the original implementation, could we just pull everything out at once and group by molecule ID?

Edit: ~~I haven't tested this at all, just spitballing some ideas~~

I added a small regression test, and the optimize_mm test will also call this method. This is near-instant on my laptop vs 1-2 hours for the older method (I suspect file i/o issues on HPC3 causing the 5+ hours remotely).

for more information, see https://pre-commit.ci

codecov-commenter · 2025-04-22T16:17:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.94%. Comparing base (46d7229) to head (9530202).
Report is 8 commits behind head on main.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

for more information, see https://pre-commit.ci

lilyminium · 2025-04-23T00:47:30Z

Never mind -- this new method finds 72105 records, the older one 72150. Not sure what's going wrong here.

mattwthompson

I think this is great;

the code change is easy to follow once I read through it and remember what it aims to do
the performance uplift is not surprising
I can see an improvement locally

After spending all too much time trying to get the logger to include timestamps, I just threw this hacky patch on to this branch and main and ran the dead-simple run.py script that's been chilling in the root of this repo for a while.

This brings that step from 350 ms down to 70 ms, albeit on a smaller dataset of 1611 conformers. I have no idea what the speedup factor is on larger systems, but I haven't tested it on systems of the size you care about. At very least I doubt this is a slowdown, if I correctly understand how it's accessing the database

If this ends up being a standalone improvement - awesome, we should get this through. If you want this as part of some larger changes, or we need to do some more exhaustive profiling, let's hold off?

lilyminium · 2025-04-23T04:27:25Z

I think there's problems in this PR -- I'm interpreting the drop in number of records as reflecting that there are multiple molecule IDs for the same InChI key and that lines 552-555 are just overwriting some of those inchi keys. Looking at the code this dictionary actually gets flattened out again (below), so IIUC that means there's actually 45 records that don't get optiimized now.

yammbs/yammbs/_minimize.py

Lines 71 to 81 in 4af2518

    
           inputs = [ 
        
               MinimizationInput( 
        
                   inchi_key=inchi_key, 
        
                   qcarchive_id=row["qcarchive_id"], 
        
                   force_field=force_field, 
        
                   mapped_smiles=row["mapped_smiles"], 
        
                   coordinates=row["coordinates"], 
        
               ) 
        
               for inchi_key in input 
        
               for row in input[inchi_key] 
        
           ]

I think this makes sense since I think mapped_smiles are used to determine different molecule IDs, and multiple mapped_smiles can map onto the same inchi. That being said I'm not sure exactly what purpose the inchi mapping actually does and I wonder if that could even be refactored out.

Edit: (so TL;DR -- yes let's hold off, thanks so much for the quick review in your evening!)

lilyminium and others added 2 commits April 23, 2025 02:05

speed up inchi mapping

380f204

[pre-commit.ci] auto fixes from pre-commit.com hooks

04384f7

for more information, see https://pre-commit.ci

add small test

88ae410

lilyminium marked this pull request as ready for review April 22, 2025 21:36

lilyminium requested a review from mattwthompson as a code owner April 22, 2025 21:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

9530202

for more information, see https://pre-commit.ci

lilyminium marked this pull request as draft April 23, 2025 00:46

jameseastwood assigned mattwthompson Apr 23, 2025

mattwthompson reviewed Apr 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite mapping InChI keys? #135

Rewrite mapping InChI keys? #135

Uh oh!

lilyminium commented Apr 22, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 22, 2025 •

edited

Loading

Uh oh!

lilyminium commented Apr 23, 2025 •

edited

Loading

Uh oh!

mattwthompson left a comment

Uh oh!

lilyminium commented Apr 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rewrite mapping InChI keys? #135

Are you sure you want to change the base?

Rewrite mapping InChI keys? #135

Uh oh!

Conversation

lilyminium commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lilyminium commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattwthompson left a comment

Choose a reason for hiding this comment

Uh oh!

lilyminium commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lilyminium commented Apr 22, 2025 •

edited

Loading

codecov-commenter commented Apr 22, 2025 •

edited

Loading

lilyminium commented Apr 23, 2025 •

edited

Loading

lilyminium commented Apr 23, 2025 •

edited

Loading