Replies: 1 comment 1 reply
-
Adding some details: This is with DataFusion 48 and 49 versions. Also for a join of this size (64 rows approximately?) I would not expect to see more than one scan of each incoming table. The custom TableProviders fully obey the column projection for scan requests, but are otherwise pretty simple internally. They are returning complete single record batches of roughly 64 rows, so I'm mystified how join is producing streams of ultra-tiny 1 and 2 row record batches as output. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm making an assessment if we can use DataFusion for our application which would fall into the "Custom Database" category use case. We will be creating lots of custom TableProviders for this work. (Thanks for the brilliantly flexible ways we can extend DataFusion!).
My problem is that when I start to create joins with our custom TableProviders I get crippling performance problems that basically eliminate DataFusion for our use case. What is happening (from printing out what I get from running joins, etc.) is that the Inner (and all others I have tested, RightSemi, LeftSemi, etc.) returns a large number (18 or more) tiny record batches (like 1 row each) of results from joins pulling from our custom TableProviders. All the extra overhead and allocation is really killing performance.
I have written essentially the same test using built-in MemTable providers and these tests perform a minimum of 10x faster for the same data sizes (in our case small batches < 64 at the moment), basically same data layouts, etc. and I've compared the constraints (each of our tables has a primary index in column 0 in all cases). When I run the same join logical plan structure in the MemTable test it returns single RecordBatches of a reasonable number of rows (for the small number of rows in this test). Whereas, the same join (as far as I can tell) on our custom TableProviders returns a giant pile of ~ 20 record batches of row size = 1 or maybe sometimes 2.
When I compare just running logical scans pulling columns from our TableProviders they are comparable to MemTable performance.
Any ideas on how to figure out what I'm doing wrong here?
Beta Was this translation helpful? Give feedback.
All reactions