SNOW-1830033: Dataframe API improvements#2811
Closed
sfc-gh-aalam wants to merge 2264 commits intomainfrom
Closed
Conversation
|
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
2 similar comments
|
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
|
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
1. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Uses bazel fat jar target and copies over only unparser fat jar instead of all runtime files. Before this change ~0.5GB of data was required to be copied everytime a user invokes `./scripts/copy-remote-ast.sh`. After this change, only the fat jar (~35MB) is copied over. PR to be merged as follow up to #2878.
…o udf/udtf (#2900) <!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1891026 2. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Issue reported by SnowML where in the context of multi-sessions the session parameter has not been passed down to udf correctly. Identified similar issue in pandas_udtf. Fixed for both.
<!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1886560 2. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [x] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Using new test account (SNOWPARK_PANDAS_QA6_TEST) for jenkins QA test due to frequent MFA issues. QA job run: https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkPythonSnowPandasDailyRegressRunner/197451/ no longer fails with `error_value = {'done_format_msg': False, 'errno': 250001, 'msg': 'Failed to connect to DB: notebook_chung.qa6.us-west-2.aws.snowflakecomputing.com:443. User is locked from Duo Security. Contact your local system administrator.', 'sqlstate': '08001'}` Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. Fixes SNOW-1890485 2. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. For any entity encoded using the AST typically with_src_position is called which stores the filename of the origin where a Snowpark API is called. Likely, there are only a few distinct files but many API calls happen in a single file typically. Encoding for any entity the full filepath leads to a bloated base64 message. To reduce the size, rather the file location is encoded with an integer that represents a lookup into an interned value table. The IR DSL compiler has been previously expanded to provide functionality to allow for interning of values. Modifications for the IR to intern the filename are done in https://github.com/snowflakedb/snowflake/tree/ls-SNOW-1890485.
<!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1229442 2. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. A single `df.quantile` call with large data + many quantiles previously emitted 81 queries, many of which were duplicate temp table operations. After #2090, this number was reduced to 6, so we can remove the `pytest.skip` on the offending test.
…#2895) Please describe how your code solves the related issue. Snowpark pandas daily test start running really slow, especially for macos (which goes beyond 4 hours now), in this pr we do the following 1) increase the macos parallelism to 6, macos is a small instance, beyond 6 will cause problems. It now took below 3 hours to finish. 2) use regular github parallelism for large window and linux instance. Each action took within 1 hour to finish now 3) only run one test instance for optimizations 4) further move away our daily test from working hour
Please describe how your code solves the related issue. test_telemetry_func_call_count is failing with error ``` 2025-01-23T06:42:31.0419150Z �[31mFAILED�[0m tests/integ/modin/test_telemetry.py::�[1mtest_telemetry_func_call_count�[0m - IndexError: list index out of range ``` It seems the Dataframe.__repr__ is occurred once in the result telemetry data. Turn this test off for now.
Please describe how your code solves the related issue. Turn off couple of time consuming tests of snowpark pandas. The test turned off in the github took over miniutes to finish verified with jenkins run https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkPythonSnowPandasDailyRegressRunner/197458/
Reduce pytest concurrency from 48 to 36
…vironment (#2879) SNOW-1886694 - Track whether snowpark is running in an interactive environment In order to improve interactive performance we need to track it, particularly for snowpandas. We use sys "ps1" and sys.flags.interactive for most runtime environments, but for snowbook we also look for a module availability. Snowbook notebooks can also run non-interactively; but snowbook does not set sys.flags.interactive right now. So this might over-count for snowbook until the flags are set properly. pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [x] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)
<!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1875145 2. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Please write a short description of how your code change solves the related issue.
…y exists before scping the JAR (#2931) 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes NOSNOW 2. Fill out the following pre-review checklist: - [ ] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Just an internal script change... No impact on anything in prod.
…#2913) <!--- Please answer these questions before creating your pull request. Thanks! ---> 1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. <!--- In this section, please add a Snowflake Jira issue number. Note that if a corresponding GitHub issue exists, you should still include the Snowflake Jira issue number. For example, for GitHub issue #1400, you should add "SNOW-1335071" here. ---> Fixes SNOW-1890997 2. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Please write a short description of how your code change solves the related issue.
Further reduce some slow tests suite
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. Fixes [SNOW-1893939](https://snowflakecomputing.atlassian.net/browse/SNOW-1893939) 2. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [ ] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) 3. Please describe how your code solves the related issue. Added support for Index.drop_duplicates. [SNOW-1893939]: https://snowflakecomputing.atlassian.net/browse/SNOW-1893939?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
…ailure (#2941) test_args_and_kwargs is the only test that is flaky and failing now, turn this off to mitigate merge gate issue. Investigate tracked by ticket https://snowflakecomputing.atlassian.net/browse/SNOW-1896426
…param after a bad merge (#3236)
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
No JIRA ticket, but required for Phase 0 Client Closeout.
2. Fill out the following pre-review checklist:
- [x] I am adding a new automated test(s) to verify correctness of my new code
- [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- [ ] I am adding new logging messages
- [ ] I am adding a new telemetry message
- [ ] I am adding new credentials
- [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)
- [x] If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: [AST Support Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#ast-abstract-syntax-tree-support-in-snowpark)
3. Please describe how your code solves the related issue.
Due to widespread internal use of the `functions.builtin` method (an alias for `functions.function`), as well as `functions.call_builtin` (an alias for `functions.call_function`), we had set up the propagation of the AST directly through these methods down to the eventual call to the internal helper `_call_function`.
As a result, despite these being public APIs which are required to support Snowflake SQL functions not explicitly supported in Snowpark, we could not capture an AST to represent `call_function("foo", *args)` when no such `functions.foo` method exists in Snowpark.
Public APIs should NOT be accepting `_ast` parameters to route generated ASTs through them as this complicates the scenario in which `_ast` is not provided but `_emit_ast` is `True`. If this Public API is used extensively internally, we then have an unresolvable question of whether to route `_ast=None` to the eventual internal helper (e.g. `_call_function`) which **will** eventually perform AST generation based on the other arguments, or whether to generate the Public API's own AST and pass it to the internal helper function.
The only resolutions to the above issue are
- To add **another** parameter to the extensively internally used public API to signal whether to pass down the provided input `_ast` parameter (regardless of whether it is `None` or not), or whether to confidently generate its own AST and assign it to the result.
- To avoid the use of public APIs internally entirely.
This PR takes the latter approach. I argue that using public APIs internally as helper functions is bad practice to motivate this refactor. Either we extract the common functionality into a separate internal helper function to be used by all Public APIs which rely on it, or (especially in this case) if all calls are eventually routed into the same internal helper (like `_call_function`), then we should avoid adding unnecessary calls to wrappers to the stack internally and call this helper directly.
The refactor is quite simple overall but `functions.py` is large:
- Replace all internal uses of `builtin/function` and `call_builtin/call_function` with direct calls to `_call_function`
- Simplify AST generation wherever possible by using `build_function_expr`
- Remove `_ast` parameters from public APIs and leave the assignment of the AST to the result as the caller's responsibility.
…s is not enabled (#3240)
Co-authored-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
…stored procedure (#3246)
…ns in OrderedDataFrame (#3144) Co-authored-by: Aaryam Maheshwary <aaryam.maheshwary@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR. [SNOW-1997136](https://snowflakecomputing.atlassian.net/browse/SNOW-1997136): Ensure transitive closure in AST Request messages. 3. Fill out the following pre-review checklist: - [x] I am adding a new automated test(s) to verify correctness of my new code - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing - [ ] I am adding new logging messages - [ ] I am adding a new telemetry message - [ ] I am adding new credentials - [ ] I am adding a new dependency - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes. - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development) - [x] If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. 4. Please describe how your code solves the related issue. This PR introduces an AST handler class `AstBuilder` which will be carried by all Snowpark objects that need to generate requests to be sent to the server side (`Table`, `Dataframe`, `UDxF/Sproc` registration, etc). Requesting a review of just the implementation for now, the refactor to remove `Session._ast_batch` and use this handler class across the client will be done in a separate PR. The dependency graph between `AstBuilder` instances will be created via the `_set_ast_ref` helpers we have introduced to Snowpark classes via the `add_required_by` or `add_dependency` methods in the new `AstBuilder` class. [SNOW-1997136]: https://snowflakecomputing.atlassian.net/browse/SNOW-1997136?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Co-authored-by: Hemit Shah <hemit.shah@snowflake.com>
🎉 Snyk checks have passed. No issues have been found so far.✅ security/snyk check is complete. No issues have been found. (View Details) ✅ license/snyk check is complete. No issues have been found. (View Details) |
cacb1bb to
611a768
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1830033
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Please write a short description of how your code change solves the related issue.