Skip to content

SNOW-2057867 refactor BindUploadAgent to make it work for Python sprocs#2303

Merged
sfc-gh-zyao merged 10 commits intomainfrom
zyao-SNOW-2057867-refactor-stage-bind-to-make-it-work-for-sprocs
May 12, 2025
Merged

SNOW-2057867 refactor BindUploadAgent to make it work for Python sprocs#2303
sfc-gh-zyao merged 10 commits intomainfrom
zyao-SNOW-2057867-refactor-stage-bind-to-make-it-work-for-sprocs

Conversation

@sfc-gh-zyao
Copy link
Contributor

  • relax the condition of using stage solution for array bind (so we will use bind_size >= threshold instead of bind_size > threshold)
  • use _upload_stream to implement BindUploadAgent

Tests

  • the new test_direct_file_operation_utils.py to validate parse_file_operation for _upload and _upload_stream
  • existing test test_bindings.py to make sure the change does not break existing bind upload logic

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes #NNNN

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am modifying authorization mechanisms
    • I am adding new credentials
    • I am modifying OCSP code
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Please write a short description of how your code change solves the related issue.

  4. (Optional) PR for stored-proc connector:

- relax the condition of using stage solution for array bind (so we will use bind_size >= threshold instead of bind_size > threshold)
- use _upload_stream to implement BindUploadAgent

### Tests
- the new test_direct_file_operation_utils.py to validate parse_file_operation for _upload and _upload_stream
- existing test test_bindings.py to make sure the change does not break existing bind upload logic
@sfc-gh-zyao
Copy link
Contributor Author

sfc-gh-zyao commented Apr 28, 2025

Update: the following content is obsolete now, because we simplified this PR and removed these premature optimization in latest commit.

(@sfc-gh-sfan 's comment, copied from the old PR #2300 (comment)):

This does not look like a pure refactor. We do an assertion, a file name split and some escaping. These do not exist in the original code. Why are they required?
  • for the assertion: it is a fool-proof mechanism, to make sure when users are doing stream upload, they don't pass in a non-empty local_file_name. But maybe it is unnecessary? Because when users call _upload_stream, there is no way for them to provide a standalone local_file_name anyways, by the definition of the interface.
  • for the file name splitting: it is an existing thing for sprocs and we want to keep the same behavior here. (I provided more context down below)
  • for the escaping: (assuming you mean the escaping for stuff inside 'file://...') it is to avoid the risk of SQL injection without changing what is allowed in local file path. From a high level (forget about stage name biding for now), we pass SQL like PUT 'file://<some_local_file_path>' @stage_foo (note the quotes surrounding the entire local file path, as a SQL injection prevention strategy). And we want to make sure the quotes in user-provided local file path still works as intended, by doing the escape (user could have local file name like, /dir'foo/test.txt, without escaping this becomes 'file:///dir'foo/test.txt', which is incorrect)

Context:
_upload_stream method's behavior is that user passes in input_stream, stage_location as input, where we derive the file name from stage_location (We need a file name because stream itself does not contain file name, and we need a name for each file in stage)

For public connector, we have the same interface and we want to keep the same user journey. Note that server is not able to parse and derive the base file name for us. So we do need to do the split.

@sfc-gh-zyao
Copy link
Contributor Author

(@sfc-gh-sfan 's comment, copied from the old PR #2300 (comment))

Is this correct? The original function use PUT which is somewhat equivalent to _upload?

Yes, this (

self.cursor.execute(
f"PUT file://{row_idx}.csv {self.stage_path}", file_stream=f
)
) is the existing public connector implementation, note that the file_stream=f is passed to execute, which does stream uploading instead of file uploading

@sfc-gh-zyao
Copy link
Contributor Author

(@sfc-gh-sfan 's comment, copied from the old PR #2300 (comment))

nit: unintended change

Removed

Comment on lines 92 to 93
stage_location, unprefixed_local_file_name = stage_location.rsplit(
"/", maxsplit=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the file name splitting: it is an existing thing for sprocs and we want to keep the same behavior here. (I provided more context down below)

Is this path used in stored proc? I somehow had the impression we override parse_file_operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your impression is right, we don't use this code in sproc execution. But we will still rsplit in public connector though, because the interface of _upload_stream looks like

def _upload_stream(
        self,
        input_stream: IO[bytes],
        stage_location: str,
        options: dict[str, Any],
        _do_reset: bool = True,
    ) -> None:

We need rsplit stage_location to get the file name

@sfc-gh-zyao
Copy link
Contributor Author

@sfc-gh-sfan FYI I have removed the premature optimization, to simplify this change

@sfc-gh-zyao sfc-gh-zyao added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md DO_NOT_PORT_CHANGES_TO_SP labels Apr 29, 2025
- We use TemporaryDirectory in this test. It's output path is single-back-slashed and could contain special characters. But `_upload` API expects normalized path. So we should do normalization and quote th path before passing it to _upload in test
- this has no effect for regular / non-sproc use cases because it is the default option
- sproc require it to be explicitly present, so we need it here (we will have a future server side change to make it optional as well for sproc)
@sfc-gh-zyao
Copy link
Contributor Author

Hi @sfc-gh-sfan commit a3c66cf is the extra fix that I mentioned earlier. Could you please take a look and let me know if it looks good to you

@sfc-gh-sfan
Copy link
Contributor

sfc-gh-sfan commented Apr 30, 2025

Could you please take a look and let me know if it looks good to you

LGTM

Copy link
Contributor

@sfc-gh-mmishchenko sfc-gh-mmishchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, besides that private methods/fields naming convention is violated now.


with self._connection.cursor() as cursor:
# Send constructed SQL to server and get back parsing result.
processed_params = cursor._connection._process_params_qmarks(params, cursor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both _connection field and _process_params_qmarks method so far have followed private _ naming convention. Probably some sort of name refactoring is good to have here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I guess your suggestion to do renaming like _process_params_qmarks -> process_params_qmarks ? I am a bit worried about confusing external customers, by making _process_params_qmarks public. Because normally they will directly use execute to run a SQL, without worrying about these binding param utils.

And here we are more or less using private / internal methods to implement it. Would it be less awkward, if I change to something like

processed_params = self._connection._process_params_qmarks(params, cursor)

(Such that it is more like we are accessing some internal fields of this current class, and therefore more natural)

Copy link
Contributor

@sfc-gh-mmishchenko sfc-gh-mmishchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@sfc-gh-zyao
Copy link
Contributor Author

sfc-gh-zyao commented May 12, 2025

The only failures in merge gate of "continuous integration" are

  • test.unit.test_oauth_token.test_oauth_code_successful_flow
  • test.unit.test_oauth_token.test_oauth_code_invalid_state
  • test.unit.test_oauth_token.test_oauth_code_token_request_error
  • test.unit.test_oauth_token.test_oauth_code_browser_timeout
  • test.unit.test_oauth_token.test_oauth_code_custom_urls
  • test.unit.test_oauth_token.test_oauth_code_successful_refresh_token_flow
  • test.unit.test_oauth_token.test_oauth_code_expired_refresh_token_flow
  • test.unit.test_oauth_token.test_client_creds_successful_flow
  • test.unit.test_oauth_token.test_client_creds_successful_refresh_token_flow
  • test.unit.test_oauth_token.test_client_creds_expired_refresh_token_flow

Those are irrelevant to this change

@sfc-gh-zyao sfc-gh-zyao merged commit 0d79989 into main May 12, 2025
94 of 95 checks passed
@sfc-gh-zyao sfc-gh-zyao deleted the zyao-SNOW-2057867-refactor-stage-bind-to-make-it-work-for-sprocs branch May 12, 2025 22:45
@github-actions github-actions bot locked and limited conversation to collaborators May 12, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants