-
Notifications
You must be signed in to change notification settings - Fork 663
refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(binding): make DerefMap computation lazy and support multiple value inputs #11540
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments/thoughts. Thanks for progressing this :)
|
||
for rel in self.rels: | ||
for field in rel.fields.values(): | ||
for val, distance in self.__class__.backtrack(field): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a separate PR I think a useful performance improvement (for long chains of expressions) would be to cache each "level" of info extracted from backtracking on a "per relation" basis
cached info would have to be held at relation level, or maybe via weakref/finalizer, so that it gets GC'd when associated relations are deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not 100% clear to me from your description what kind performance improvement you might expect here.
Thinking it through out loud, it seems like the improvement scales with the number of relations in a chain, or in terms of implementation with the number of DerefMap
s constructed. I think this is in line with your supposition about long chains of expressions.
I think that the complexity is in figuring out who owns the cache (another thing you're alluding to!).
What if instead of adding caching to DerefMap
, we give every instance of Table
or the underlying operation a lazily constructed deref map.
This would then have the effect of tying the deref map to the object, effectively caching it for the instance, and we don't have to add complexity to DerefMap
to make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks sorry for not being very clear, just to clarify:
- at the moment a derefmap is constructed for every call to .bind on a table
- this has a performance impact if:
- a table is used as the base for many queries
- or a chain of operations is constructed (map from ancestors could possibly be used to build descendent's map)
I agree that the map could be held as an attribute at table level and that might be better than having derefmap manage a set of cached maps via weakref etc
another thing to note is that for chains of operations derefmaps grow in size in a way that might not make naive caching a good solution. total memory requirements for a length D chain of operations with N fields is O(N*D^2), because every additional operation in the chain includes all derefmap items from the level above.
Another approach would be to only cache 1 layer depth of derefs on each table, and then do something like:
while expr.table != self: # assumes expressions have some concept of the table they are bound to
expr = self._deref_maps_by_src_table[expr.table][expr]
This would have total memory requirements of O(N*D) (instead of D^2). Deref maps in deep chains would be fast to construct, but fields from relations far up the chain would take longer to dereference (my assumption was that users are unlikely to use distant fields but I have no evidence)
There was a little more explanation in the previous PR so will link to relevant rather than duplicating the whole thing: #11458 (comment)
naive caching could be done as a start point and that's a decision I'm happy to leave with ibis team :) I guess for a 10 item chain of 100 column table its still just ~10000 dict items. but if people are doing much bigger things in practice it could be an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a collections.ChainMap
might serve this purpose well, with each new child map being only the current relation's new fields.
I believe then (as you say) that there's now an additional O(D)
number of operations for a lookup in the worst case (when a look up is in the first parent).
I don't know whether the "chaining radius" is usually small, but I would guess that it is, as it seems difficult to reason about things the bigger the radius.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great thanks for the reply. I think it might be worth looking into at a later stage, I've been testing main with typical queries we run and have seen a good reduction (>50%) in time taken to build expressions. Most queries benefit from the "lazy derefmap" changes and so are now not affected by derefmap at all, and main bottlenecks are elsewhere within ibis, so I'll likely look at those next when that bubbles to top of my list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet! Please make issues for the performance problems you encounter as they arise!
ed27d33
to
7061eee
Compare
0fa380e
to
4b2fa97
Compare
4b2fa97
to
24118ae
Compare
Refactor DerefMap traversal to be lazy