refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540

cpcloud · 2025-08-07T16:22:19Z

Refactor DerefMap traversal to be lazy

hottwaj

some minor comments/thoughts. Thanks for progressing this :)

ibis/expr/rewrites.py

hottwaj · 2025-08-08T10:19:48Z

ibis/expr/rewrites.py

+
+        for rel in self.rels:
+            for field in rel.fields.values():
+                for val, distance in self.__class__.backtrack(field):


in a separate PR I think a useful performance improvement (for long chains of expressions) would be to cache each "level" of info extracted from backtracking on a "per relation" basis
cached info would have to be held at relation level, or maybe via weakref/finalizer, so that it gets GC'd when associated relations are deleted

It's not 100% clear to me from your description what kind performance improvement you might expect here.

Thinking it through out loud, it seems like the improvement scales with the number of relations in a chain, or in terms of implementation with the number of DerefMaps constructed. I think this is in line with your supposition about long chains of expressions.

I think that the complexity is in figuring out who owns the cache (another thing you're alluding to!).

What if instead of adding caching to DerefMap, we give every instance of Table or the underlying operation a lazily constructed deref map.

This would then have the effect of tying the deref map to the object, effectively caching it for the instance, and we don't have to add complexity to DerefMap to make it work.

thanks sorry for not being very clear, just to clarify:

at the moment a derefmap is constructed for every call to .bind on a table

this has a performance impact if:

a table is used as the base for many queries

or a chain of operations is constructed (map from ancestors could possibly be used to build descendent's map)

I agree that the map could be held as an attribute at table level and that might be better than having derefmap manage a set of cached maps via weakref etc

another thing to note is that for chains of operations derefmaps grow in size in a way that might not make naive caching a good solution. total memory requirements for a length D chain of operations with N fields is O(N*D^2), because every additional operation in the chain includes all derefmap items from the level above.

Another approach would be to only cache 1 layer depth of derefs on each table, and then do something like:

while expr.table != self: # assumes expressions have some concept of the table they are bound to expr = self._deref_maps_by_src_table[expr.table][expr]

This would have total memory requirements of O(N*D) (instead of D^2). Deref maps in deep chains would be fast to construct, but fields from relations far up the chain would take longer to dereference (my assumption was that users are unlikely to use distant fields but I have no evidence)

There was a little more explanation in the previous PR so will link to relevant rather than duplicating the whole thing: #11458 (comment)

naive caching could be done as a start point and that's a decision I'm happy to leave with ibis team :) I guess for a 10 item chain of 100 column table its still just ~10000 dict items. but if people are doing much bigger things in practice it could be an issue

I think a collections.ChainMap might serve this purpose well, with each new child map being only the current relation's new fields.

I believe then (as you say) that there's now an additional O(D) number of operations for a lookup in the worst case (when a look up is in the first parent).

I don't know whether the "chaining radius" is usually small, but I would guess that it is, as it seems difficult to reason about things the bigger the radius.

great thanks for the reply. I think it might be worth looking into at a later stage, I've been testing main with typical queries we run and have seen a good reduction (>50%) in time taken to build expressions. Most queries benefit from the "lazy derefmap" changes and so are now not affected by derefmap at all, and main bottlenecks are elsewhere within ibis, so I'll likely look at those next when that bubbles to top of my list

Sweet! Please make issues for the performance problems you encounter as they arise!

ibis/expr/rewrites.py

…e value inputs

…ncing

cpcloud requested a review from kszucs August 7, 2025 16:22

github-actions bot added the tests Issues or PRs related to tests label Aug 7, 2025

hottwaj reviewed Aug 8, 2025

View reviewed changes

ibis/expr/rewrites.py Show resolved Hide resolved

cpcloud force-pushed the refactor-deref-map-to-be-lazy branch 2 times, most recently from ed27d33 to 7061eee Compare August 8, 2025 13:54

cpcloud changed the title ~~refactor deref map to be lazy~~ refactor(binding): make DerefMap computation lazy and support multiple value inputs Aug 8, 2025

cpcloud force-pushed the refactor-deref-map-to-be-lazy branch 2 times, most recently from 0fa380e to 4b2fa97 Compare August 19, 2025 13:29

cpcloud added 5 commits August 19, 2025 12:38

refactor(binding): make DerefMap computation lazy and support multipl…

a40332d

…e value inputs

refactor(deref): make dereference method a generator

bc27736

chore: minimize overhead due to repeated set construction in derefere…

f364867

…ncing

chore: make Table.bind lazy

a464e1e

chore: avoid storing an extra dict in the common case of unused extra

24118ae

cpcloud force-pushed the refactor-deref-map-to-be-lazy branch from 4b2fa97 to 24118ae Compare August 19, 2025 16:42

cpcloud marked this pull request as ready for review August 19, 2025 16:43

cpcloud merged commit 5757caa into ibis-project:main Aug 21, 2025
146 of 147 checks passed

cpcloud deleted the refactor-deref-map-to-be-lazy branch August 21, 2025 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540

refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540

Uh oh!

cpcloud commented Aug 7, 2025

Uh oh!

hottwaj left a comment

Uh oh!

Uh oh!

hottwaj Aug 8, 2025 •

edited

Loading

Uh oh!

cpcloud Aug 8, 2025

Uh oh!

hottwaj Aug 8, 2025 •

edited

Loading

Uh oh!

cpcloud Aug 19, 2025

Uh oh!

hottwaj Aug 20, 2025

Uh oh!

cpcloud Aug 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540

refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540

Uh oh!

Conversation

cpcloud commented Aug 7, 2025

Uh oh!

hottwaj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hottwaj Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

hottwaj Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

hottwaj Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hottwaj Aug 8, 2025 •

edited

Loading

hottwaj Aug 8, 2025 •

edited

Loading