Skip to content

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 7, 2025

Refactor DerefMap traversal to be lazy

@cpcloud cpcloud requested a review from kszucs August 7, 2025 16:22
@github-actions github-actions bot added the tests Issues or PRs related to tests label Aug 7, 2025
Copy link
Contributor

@hottwaj hottwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments/thoughts. Thanks for progressing this :)


for rel in self.rels:
for field in rel.fields.values():
for val, distance in self.__class__.backtrack(field):
Copy link
Contributor

@hottwaj hottwaj Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a separate PR I think a useful performance improvement (for long chains of expressions) would be to cache each "level" of info extracted from backtracking on a "per relation" basis
cached info would have to be held at relation level, or maybe via weakref/finalizer, so that it gets GC'd when associated relations are deleted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not 100% clear to me from your description what kind performance improvement you might expect here.

Thinking it through out loud, it seems like the improvement scales with the number of relations in a chain, or in terms of implementation with the number of DerefMaps constructed. I think this is in line with your supposition about long chains of expressions.

I think that the complexity is in figuring out who owns the cache (another thing you're alluding to!).

What if instead of adding caching to DerefMap, we give every instance of Table or the underlying operation a lazily constructed deref map.

This would then have the effect of tying the deref map to the object, effectively caching it for the instance, and we don't have to add complexity to DerefMap to make it work.

Copy link
Contributor

@hottwaj hottwaj Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks sorry for not being very clear, just to clarify:

  • at the moment a derefmap is constructed for every call to .bind on a table
  • this has a performance impact if:
    • a table is used as the base for many queries
    • or a chain of operations is constructed (map from ancestors could possibly be used to build descendent's map)

I agree that the map could be held as an attribute at table level and that might be better than having derefmap manage a set of cached maps via weakref etc

another thing to note is that for chains of operations derefmaps grow in size in a way that might not make naive caching a good solution. total memory requirements for a length D chain of operations with N fields is O(N*D^2), because every additional operation in the chain includes all derefmap items from the level above.

Another approach would be to only cache 1 layer depth of derefs on each table, and then do something like:

while expr.table != self:  # assumes expressions have some concept of the table they are bound to
    expr = self._deref_maps_by_src_table[expr.table][expr]

This would have total memory requirements of O(N*D) (instead of D^2). Deref maps in deep chains would be fast to construct, but fields from relations far up the chain would take longer to dereference (my assumption was that users are unlikely to use distant fields but I have no evidence)

There was a little more explanation in the previous PR so will link to relevant rather than duplicating the whole thing: #11458 (comment)

naive caching could be done as a start point and that's a decision I'm happy to leave with ibis team :) I guess for a 10 item chain of 100 column table its still just ~10000 dict items. but if people are doing much bigger things in practice it could be an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a collections.ChainMap might serve this purpose well, with each new child map being only the current relation's new fields.

I believe then (as you say) that there's now an additional O(D) number of operations for a lookup in the worst case (when a look up is in the first parent).

I don't know whether the "chaining radius" is usually small, but I would guess that it is, as it seems difficult to reason about things the bigger the radius.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks for the reply. I think it might be worth looking into at a later stage, I've been testing main with typical queries we run and have seen a good reduction (>50%) in time taken to build expressions. Most queries benefit from the "lazy derefmap" changes and so are now not affected by derefmap at all, and main bottlenecks are elsewhere within ibis, so I'll likely look at those next when that bubbles to top of my list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Please make issues for the performance problems you encounter as they arise!

@cpcloud cpcloud force-pushed the refactor-deref-map-to-be-lazy branch 2 times, most recently from ed27d33 to 7061eee Compare August 8, 2025 13:54
@cpcloud cpcloud changed the title refactor deref map to be lazy refactor(binding): make DerefMap computation lazy and support multiple value inputs Aug 8, 2025
@cpcloud cpcloud force-pushed the refactor-deref-map-to-be-lazy branch 2 times, most recently from 0fa380e to 4b2fa97 Compare August 19, 2025 13:29
@cpcloud cpcloud force-pushed the refactor-deref-map-to-be-lazy branch from 4b2fa97 to 24118ae Compare August 19, 2025 16:42
@cpcloud cpcloud marked this pull request as ready for review August 19, 2025 16:43
@cpcloud cpcloud merged commit 5757caa into ibis-project:main Aug 21, 2025
146 of 147 checks passed
@cpcloud cpcloud deleted the refactor-deref-map-to-be-lazy branch August 21, 2025 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants