Skip to content

SNOW-1830033: Dataframe API improvements#2811

Closed
sfc-gh-aalam wants to merge 2264 commits intomainfrom
SNOW-1830033-dataframe-api-bcrs
Closed

SNOW-1830033: Dataframe API improvements#2811
sfc-gh-aalam wants to merge 2264 commits intomainfrom
SNOW-1830033-dataframe-api-bcrs

Conversation

@sfc-gh-aalam
Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam commented Dec 25, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1830033

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    Please write a short description of how your code change solves the related issue.

@github-actions
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

2 similar comments
@github-actions
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@github-actions
Copy link

github-actions bot commented Jan 3, 2025

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-lspiegelberg and others added 27 commits January 22, 2025 01:35
1. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Uses bazel fat jar target and copies over only unparser fat jar instead
of all runtime files. Before this change ~0.5GB of data was required to
be copied everytime a user invokes `./scripts/copy-remote-ast.sh`. After
this change, only the fat jar (~35MB) is copied over. PR to be merged as
follow up to #2878.
…o udf/udtf (#2900)

<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1891026

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Issue reported by SnowML where in the context of multi-sessions the
session parameter has not been passed down to udf correctly. Identified
similar issue in pandas_udtf. Fixed for both.
<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1886560

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [x] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.
Using new test account (SNOWPARK_PANDAS_QA6_TEST) for jenkins QA test
due to frequent MFA issues.
QA job run:
https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkPythonSnowPandasDailyRegressRunner/197451/
no longer fails with `error_value = {'done_format_msg': False, 'errno':
250001, 'msg': 'Failed to connect to DB:
notebook_chung.qa6.us-west-2.aws.snowflakecomputing.com:443. User is
locked from Duo Security. Contact your local system administrator.',
'sqlstate': '08001'}`

Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   Fixes SNOW-1890485

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

For any entity encoded using the AST typically with_src_position is
called which stores the filename of the origin where a Snowpark API is
called. Likely, there are only a few distinct files but many API calls
happen in a single file typically. Encoding for any entity the full
filepath leads to a bloated base64 message. To reduce the size, rather
the file location is encoded with an integer that represents a lookup
into an interned value table. The IR DSL compiler has been previously
expanded to provide functionality to allow for interning of values.
Modifications for the IR to intern the filename are done in
https://github.com/snowflakedb/snowflake/tree/ls-SNOW-1890485.
<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1229442

2. Fill out the following pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

A single `df.quantile` call with large data + many quantiles previously
emitted 81 queries, many of which were duplicate temp table operations.
After #2090, this number was reduced to 6, so we can remove the
`pytest.skip` on the offending test.
…#2895)

Please describe how your code solves the related issue.
Snowpark pandas daily test start running really slow, especially for
macos (which goes beyond 4 hours now), in this pr we do the following
1) increase the macos parallelism to 6, macos is a small instance,
beyond 6 will cause problems. It now took below 3 hours to finish.
2) use regular github parallelism for large window and linux instance.
Each action took within 1 hour to finish now
3) only run one test instance for optimizations
4) further move away our daily test from working hour
Please describe how your code solves the related issue.

test_telemetry_func_call_count is failing with error 
```
2025-01-23T06:42:31.0419150Z �[31mFAILED�[0m tests/integ/modin/test_telemetry.py::�[1mtest_telemetry_func_call_count�[0m - IndexError: list index out of range
```
It seems the Dataframe.__repr__ is occurred once in the result telemetry
data. Turn this test off for now.
Please describe how your code solves the related issue.

Turn off couple of time consuming tests of snowpark pandas. The test
turned off in the github took over miniutes to finish


verified with jenkins run
https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkPythonSnowPandasDailyRegressRunner/197458/
Reduce pytest concurrency from 48 to 36
…vironment (#2879)

SNOW-1886694 - Track whether snowpark is running in an interactive
environment
In order to improve interactive performance we need to track it,
particularly for snowpandas. We use sys "ps1" and sys.flags.interactive
for most runtime environments, but for snowbook we also look for a
module availability. Snowbook notebooks can also run non-interactively;
but snowbook does not set sys.flags.interactive right now. So this might
over-count for snowbook until the flags are set properly.

pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [x] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)
<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1875145

2. Fill out the following pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Please write a short description of how your code change solves the
related issue.
…y exists before scping the JAR (#2931)

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes NOSNOW

2. Fill out the following pre-review checklist:

- [ ] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

   Just an internal script change... No impact on anything in prod.
…#2913)

<!---
Please answer these questions before creating your pull request. Thanks!
--->

1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

   <!---
   In this section, please add a Snowflake Jira issue number.

Note that if a corresponding GitHub issue exists, you should still
include
   the Snowflake Jira issue number. For example, for GitHub issue
#1400, you should
   add "SNOW-1335071" here.
    --->

   Fixes SNOW-1890997

2. Fill out the following pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [x] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

Please write a short description of how your code change solves the
related issue.
Further reduce some slow tests suite
1. Which Jira issue is this PR addressing? Make sure that there is an
accompanying issue to your PR.

Fixes
[SNOW-1893939](https://snowflakecomputing.atlassian.net/browse/SNOW-1893939)

2. Fill out the following pre-review checklist:

- [x] I am adding a new automated test(s) to verify correctness of my
new code
- [ ] If this test skips Local Testing mode, I'm requesting review from
@snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
- [ ] If this is a new feature/behavior, I'm adding the Local Testing
parity changes.
- [ ] I acknowledge that I have ensured my changes to be thread-safe.
Follow the link for more information: [Thread-safe Developer
Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)

3. Please describe how your code solves the related issue.

   Added support for Index.drop_duplicates.

[SNOW-1893939]:
https://snowflakecomputing.atlassian.net/browse/SNOW-1893939?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
…ailure (#2941)

test_args_and_kwargs is the only test that is flaky and failing now,
turn this off to mitigate merge gate issue. Investigate tracked by
ticket https://snowflakecomputing.atlassian.net/browse/SNOW-1896426
sfc-gh-helmeleegy and others added 26 commits April 4, 2025 16:03
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

   No JIRA ticket, but required for Phase 0 Client Closeout.

2. Fill out the following pre-review checklist:

   - [x] I am adding a new automated test(s) to verify correctness of my new code
      - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
   - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes.
   - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)
   - [x] If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: [AST Support Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#ast-abstract-syntax-tree-support-in-snowpark)

3. Please describe how your code solves the related issue.

   Due to widespread internal use of the `functions.builtin` method (an alias for `functions.function`), as well as `functions.call_builtin` (an alias for `functions.call_function`), we had set up the propagation of the AST directly through these methods down to the eventual call to the internal helper `_call_function`.

   As a result, despite these being public APIs which are required to support Snowflake SQL functions not explicitly supported in Snowpark, we could not capture an AST to represent `call_function("foo", *args)` when no such `functions.foo` method exists in Snowpark.

   Public APIs should NOT be accepting `_ast` parameters to route generated ASTs through them as this complicates the scenario in which `_ast` is not provided but `_emit_ast` is `True`. If this Public API is used extensively internally, we then have an unresolvable question of whether to route `_ast=None` to the eventual internal helper (e.g. `_call_function`) which **will** eventually perform AST generation based on the other arguments, or whether to generate the Public API's own AST and pass it to the internal helper function.

   The only resolutions to the above issue are
     - To add **another** parameter to the extensively internally used public API to signal whether to pass down the provided input `_ast` parameter (regardless of whether it is `None` or not), or whether to confidently generate its own AST and assign it to the result.
     - To avoid the use of public APIs internally entirely.

   This PR takes the latter approach. I argue that using public APIs internally as helper functions is bad practice to motivate this refactor. Either we extract the common functionality into a separate internal helper function to be used by all Public APIs which rely on it, or (especially in this case) if all calls are eventually routed into the same internal helper (like `_call_function`), then we should avoid adding unnecessary calls to wrappers to the stack internally and call this helper directly.

   The refactor is quite simple overall but `functions.py` is large:
    - Replace all internal uses of `builtin/function` and `call_builtin/call_function` with direct calls to `_call_function`
    - Simplify AST generation wherever possible by using `build_function_expr`
    - Remove `_ast` parameters from public APIs and leave the assignment of the AST to the result as the caller's responsibility.
Co-authored-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
Signed-off-by: Labanya Mukhopadhyay <labanya.mukhopadhyay@snowflake.com>
…ns in OrderedDataFrame (#3144)

Co-authored-by: Aaryam Maheshwary <aaryam.maheshwary@snowflake.com>
1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

   [SNOW-1997136](https://snowflakecomputing.atlassian.net/browse/SNOW-1997136): Ensure transitive closure in AST Request messages.

3. Fill out the following pre-review checklist:

   - [x] I am adding a new automated test(s) to verify correctness of my new code
      - [ ] If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
   - [ ] I am adding new logging messages
   - [ ] I am adding a new telemetry message
   - [ ] I am adding new credentials
   - [ ] I am adding a new dependency
   - [ ] If this is a new feature/behavior, I'm adding the Local Testing parity changes.
   - [x] I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: [Thread-safe Developer Guidelines](https://github.com/snowflakedb/snowpark-python/blob/main/CONTRIBUTING.md#thread-safe-development)
   - [x] If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support.

4. Please describe how your code solves the related issue.

   This PR introduces an AST handler class `AstBuilder` which will be carried by all Snowpark objects that need to generate requests to be sent to the server side (`Table`, `Dataframe`, `UDxF/Sproc` registration, etc).

    Requesting a review of just the implementation for now, the refactor to remove `Session._ast_batch` and use this handler class across the client will be done in a separate PR.

    The dependency graph between `AstBuilder` instances will be created via the `_set_ast_ref` helpers we have introduced to Snowpark classes via the `add_required_by` or `add_dependency` methods in the new `AstBuilder` class.

[SNOW-1997136]: https://snowflakecomputing.atlassian.net/browse/SNOW-1997136?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Co-authored-by: Hemit Shah <hemit.shah@snowflake.com>
@sfc-gh-snowflakedb-snyk-sa
Copy link

sfc-gh-snowflakedb-snyk-sa commented Apr 15, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

@sfc-gh-aalam sfc-gh-aalam force-pushed the SNOW-1830033-dataframe-api-bcrs branch from cacb1bb to 611a768 Compare November 12, 2025 23:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.