-
Notifications
You must be signed in to change notification settings - Fork 2
Rewrite mapping InChI keys? #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
🚀 New features to boost your workflow:
|
for more information, see https://pre-commit.ci
|
Never mind -- this new method finds 72105 records, the older one 72150. Not sure what's going wrong here. |
mattwthompson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great;
- the code change is easy to follow once I read through it and remember what it aims to do
- the performance uplift is not surprising
- I can see an improvement locally
After spending all too much time trying to get the logger to include timestamps, I just threw this hacky patch on to this branch and main and ran the dead-simple run.py script that's been chilling in the root of this repo for a while.
This brings that step from 350 ms down to 70 ms, albeit on a smaller dataset of 1611 conformers. I have no idea what the speedup factor is on larger systems, but I haven't tested it on systems of the size you care about. At very least I doubt this is a slowdown, if I correctly understand how it's accessing the database
If this ends up being a standalone improvement - awesome, we should get this through. If you want this as part of some larger changes, or we need to do some more exhaustive profiling, let's hold off?
|
I think there's problems in this PR -- I'm interpreting the drop in number of records as reflecting that there are multiple molecule IDs for the same InChI key and that lines 552-555 are just overwriting some of those inchi keys. Looking at the code this dictionary actually gets flattened out again (below), so IIUC that means there's actually 45 records that don't get optiimized now. Lines 71 to 81 in 4af2518
I think this makes sense since I think mapped_smiles are used to determine different molecule IDs, and multiple mapped_smiles can map onto the same inchi. That being said I'm not sure exactly what purpose the inchi mapping actually does and I wonder if that could even be refactored out. Edit: (so TL;DR -- yes let's hold off, thanks so much for the quick review in your evening!) |
Just opening as a draft PR to show some prototype code -- would it be possible to speed up the inchi mapping at all? I put in a quick progress bar in the original method and it can take anywhere between 2-5 hours on the filtered Industry Benchmark Set, which is quite a while. If inchi is a one-to-one mapping to molecule ID as implied by the original implementation, could we just pull everything out at once and group by molecule ID?
Edit:
I haven't tested this at all, just spitballing some ideasI added a small regression test, and the
optimize_mmtest will also call this method. This is near-instant on my laptop vs 1-2 hours for the older method (I suspect file i/o issues on HPC3 causing the 5+ hours remotely).