Skip to content

Comments

Load s3 config from env and directories using aws-config crate#203

Merged
kylebarron merged 5 commits intomainfrom
kyle/s3-config-credentials
Feb 5, 2025
Merged

Load s3 config from env and directories using aws-config crate#203
kylebarron merged 5 commits intomainfrom
kyle/s3-config-credentials

Conversation

@kylebarron
Copy link
Member

@kylebarron kylebarron commented Feb 3, 2025

Explore using aws-config

Other prior work:

cc @ion-elgreco since you've already discussed this in delta-rs

Closes #202

Todo:

  • Make note that this will not pickle the credential provider

Copy link

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked what the binary size is after adding this?

If these AWS crates add some size you could potentially also make a credential provider that checks if boto3 is available.

I noticed our AWS integration was like 10% of our binary, albeit it contains more than - just the AWS crates, but worth checking though :)

///
/// Downstream consumers may explicitly want to depend on tokio and add `rt-multi-thread` as a
/// tokio feature flag to opt-in to the multi-threaded tokio runtime.
pub fn get_runtime(py: Python<'_>) -> PyResult<Arc<Runtime>> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to reuse our runtime init: https://github.com/delta-io/delta-rs/blob/9dc8f328859b4fe66fa82a636aae662cee22db30/python/src/utils.rs#L11

It has a more graceful error for forked processes

Copy link
Member Author

@kylebarron kylebarron Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand forked processes but I'll take your word for it and use that 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""

@classmethod
def from_aws_defaults(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that you separated and have to explicitly call it.

We have it disabled when you provide a custom endpoint because that generally means you are not on AWS, the AWS SDK will throw quite some warnings and try to resolve things with no success if you are not on aws..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any naming suggestions here?

from_native_credentials

maybe?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep aws at least in the wording since it's an AWS only feature.

Maybe from_aws_native or from_aws_credentials

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... yes, but the S3 part of S3Store.from_native_credentials should already make clear that it's specific to AWS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from_native_credentials is too long of a name though

@kylebarron
Copy link
Member Author

kylebarron commented Feb 3, 2025

Have you checked what the binary size is after adding this?

(Compressed) wheel goes from 3.4MB to 4.8MB on macOS. I'm inclined to say that's ok if it significantly simplifies AWS credentials

@kylebarron
Copy link
Member Author

potentially also make a credential provider that checks if boto3 is available

We do have boto3.session integration already by the way: https://developmentseed.org/obstore/latest/api/store/aws/#obstore.store.S3Store.from_session

But it only passes in frozen credentials; it won't refresh the credentials. And I imagine having the credential provider go through Python would be slow.

@ion-elgreco
Copy link

potentially also make a credential provider that checks if boto3 is available

We do have boto3.session integration already by the way: https://developmentseed.org/obstore/latest/api/store/aws/#obstore.store.S3Store.from_session

But it only passes in frozen credentials; it won't refresh the credentials. And I imagine having the credential provider go through Python would be slow.

Right, I missed that :)

@ion-elgreco
Copy link

Have you checked what the binary size is after adding this?

(Compressed) wheel goes from 3.4MB to 4.8MB on macOS. I'm inclined to say that's ok if it significantly simplifies AWS credentials

Yeah up to you :) just wanted to highlight it only in case you wanted to keep the binary size very small

@ion-elgreco
Copy link

ion-elgreco commented Feb 4, 2025

Another thought, maybe you can use a custom env var (USE_AWS_NATIVE_CREDENRIALS) to force AWS native credentials resolving in the init of S3Store

This could be useful in setups where only init is called on the S3Store by an external library, however you want it to pick up the native credentials.

@kylebarron
Copy link
Member Author

(Compressed) wheel goes from 3.4MB to 4.8MB on macOS. I'm inclined to say that's ok if it significantly simplifies AWS credentials

Yeah up to you :) just wanted to highlight it only in case you wanted to keep the binary size very small

All else equal of course I want the binary size to stay small. It would be ideal if we could optionally distribute the heightened AWS support, but that's hard to do.

If ObjectStore were FFI stable we could distribute s3 bindings in a separate Python namespace module, but that's not really feasible today.

And given that S3 is the most popular cloud this extra 1.4MB seems worth it?

Another thought, maybe you can use a custom env var (USE_AWS_NATIVE_CREDENRIALS) to force AWS native credentials resolving in the init of S3Store

This could be useful in setups where only init is called on the S3Store by an external library, however you want it to pick up the native credentials.

That sounds much too magical, IMO. A better approach would be for external libraries to (at least optionally) accept a store: ObjectStore parameter, so that users can first construct the store with whatever parameters they care about. That also means users will get much better static typing when constructing the store, compared to kwargs that get passed down internally.

@ion-elgreco
Copy link

ion-elgreco commented Feb 4, 2025

Perhaps the credentials provider can be in a separate library using ffi instead of the full S3 implementation.

I do see the need for ffi objectstore, also today someone asked about it in polars discord.

Regarding the auto credentials loading when an environment variable is set, it is bit of auto magic but you have to consider some use cases where users don't provide a store themselves.

Take this dagster-obstore thing I wrote, users configure the store through a yaml configuration. If they can toggle credentials loading through an env var its quite useful

@kylebarron
Copy link
Member Author

Perhaps the credentials provider can be in a separate library using ffi instead of the full S3 implementation.

I haven't designed an FFI implementation from scratch myself (e.g. I've only reused Arrow's), so while I'm open in theory to using FFI for the credentials provider it's not something I want to put my own time into doing right now (though an issue laying out a path forward would be welcome for discussion).

I'm inclined to provisionally merge this and at some point in the future we could refactor to support external credentials providers.

Take this dagster-obstore thing I wrote, users configure the store through a yaml configuration. If they can toggle credentials loading through an env var its quite useful

Right... I see in principle the benefit of passing in config via env vars, but I'm very wary of signing up for too much of a maintenance burden too quickly. Especially because I believe it's possible this library could get quite popular and heavily used. So for stuff like that I think it's better to wait and see how many people chime in on the issues asking for it, rather than moving too quickly and being stuck.

In dagster-obstore it would be possible for you to check if USE_AWS_NATIVE_CREDENTIALS is set, and switch on whether you call __init__ or from_native_credentials, right?

@kylebarron kylebarron marked this pull request as ready for review February 5, 2025 00:03
@kylebarron
Copy link
Member Author

Have you checked what the binary size is after adding this?

(Compressed) wheel goes from 3.4MB to 4.8MB on macOS. I'm inclined to say that's ok if it significantly simplifies AWS credentials

Yeah up to you :) just wanted to highlight it only in case you wanted to keep the binary size very small

Well pip install -t . s3fs takes up ~35MB of space on MacOS, so an extra 1.4 MB I'm ok with!

@kylebarron
Copy link
Member Author

I renamed the constructor to

_from_native

and added some documentation to call this provisionally supported. I'd like to get more feedback (including naming) before committing to a specific API, but we can include it in a release still!

@kylebarron kylebarron merged commit 5580311 into main Feb 5, 2025
4 checks passed
@kylebarron kylebarron deleted the kyle/s3-config-credentials branch February 5, 2025 00:06
kylebarron added a commit that referenced this pull request Feb 27, 2025
* Remove `S3Store._from_native`

* Remove aws-config dependency

* update changelog

* fix link
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use AWS Rust SDK To Source Credentials For S3

2 participants