PERF-#7657: Fork pandas eval and query implementation to improve performance.#7658
Conversation
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
sfc-gh-joshi
left a comment
There was a problem hiding this comment.
One question regarding the license for forked code. Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
sfc-gh-mvashishtha
left a comment
There was a problem hiding this comment.
Also, how much of the pandas code did you need to change besides pointing import paths to modin equivalents of pandas modules? If it was very little then we may want to make clearer from folder naming that this is essentially vendored pandas code.
@sfc-gh-joshi I did make very few changes, but I don't think it's that important to point out with the directory structure that some code has been vendored from pandas. If we were to vendor an entire package I think it would make sense to put it in a new vendored directory. We are also putting some modified code in dataframe.py and other modified code in modin/core/computation/.
In that case, as long as all relevant files have something we can grep for if we want to pull in upstream changes it should be fine. |
|
@devin-petersohn could you PTAL at the licensing changes? Thanks! |
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Currently we use the pandas eval() and query() implementations almost entirely as is. That's not good practice in general, and #7657 shows a performance issue that applies to Modin but not pandas in the current implementation.
In this commit, fork the query() and eval() implementation and eliminate the
.valuescall that causes numpy materialization.The code here is mostly copied from pandas/pandas/core/computation, except:
.valuescall that causes the performance issue in PERF: Fork pandas eval() and query() implementation to reduce to_numpy() calls #7657.Resolves #7657