Update Modin join benchmark to current state#162
Update Modin join benchmark to current state#162gshimansky wants to merge 2 commits intoh2oai:masterfrom
Conversation
Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>
Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>
|
Thank you for contributing this script. I am now running modin join benchmark. Will report back when it will finish. |
|
Below I am presenting timings made on this PR (precisely speaking, on https://github.com/h2oai/db-benchmark/tree/modin-join-dev). Quite obvious observation is that there is problem with performance of join question 5: big-to-big join. So 1e7 rows join 1e7 rows, 1e8 to 1e8, 1e9 to 1e9. That is quite common problem for a software that works in distributed manner, you may find this video interesting https://www.youtube.com/watch?v=5X7h1rZGVs0 Another thing, more disturbing actually, are timings values in
We generally expect this value to be very low, much lower than the value of 1e7Timings for all 5 questions: All joins queries sucessfully finished in 1859s. 1e8When trying to do first run of q5 python is being Timings of q1-q4: 1e9In case of 1e9 rows data, script is already failing during loading data. Unless modin can handle out-of-memory data this is expected. If modin is able to handle out-of-memory data (does it?), then we should enable that just for 1e9 data size. |
|
I checked with Modin developer @YarShev who knows details about merge operation, that we don't have any lazy computation for it. Performance there is a subject for investigation because I see these problems too, but we didn't figure out the reason for this behavior yet. As for memory, it looks like no configurations are able to pass |
I updated Modin implementation of join benchmark to current state. Mostly code is copied from Pandas version but there are some differences.