Skip to content

add singleton to track bucket/workspace information in scala#443

Open
jdries wants to merge 1 commit intodevelopfrom
workspacerepo
Open

add singleton to track bucket/workspace information in scala#443
jdries wants to merge 1 commit intodevelopfrom
workspacerepo

Conversation

@jdries
Copy link
Contributor

@jdries jdries commented May 7, 2025

Copy link
Contributor

@pvbouwel pvbouwel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the PR but implementation wise there are some unknowns and I would have some things different

val isCloudFerro = s3Endpoint != null &&
(s3Endpoint.toLowerCase.contains("cloudferro") || s3Endpoint.toLowerCase.endsWith(".dataspace.copernicus.eu"))

val maybeWorkspace = WorkspaceRepository.get().getWorkspaceByBucket(s3Uri.getBucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this logic is best placed here. CreoS3Utils has helpers like getCreoS3Client so if we solve the problem here in MultiClientRangeReader we'd have to solve it in other places as well. Ideally there is one factory for creating S3 clients. There are however multiple problems with getCreoS3Client and it's consumption:

  1. It assumes the region is known or defaults to 'RegionOne'
    a. a hard coded default does not make sense imho
    b. the placeholder value would cause the endpoint to be resolved via an environment variable SWIFT_URL but it would not change the region. If sigv4 checking is done stricly then it would lead to authorization failures.
  2. From a consumption part it often called in the same file without specifying an argument

I like the idea of a Workspace Repository. Python could provision the workspaces info that is potentially encountered.

So a ClientFactory method that just takes a bucket name (or S3URI from which it can extract the bucket name) use the workspace repository to resolve bucket to S3 details (region + endpoint or region+profile) and potentially falls back to the legacy resolution would be usable in both places.

*/

private val workspaces = scala.collection.mutable.Map[String, WorkspaceConfig]()
private val workspacesByBucket = scala.collection.mutable.Map[String, WorkspaceConfig]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason not to just have a Map[String, String]. A 1-to-1 relationship between workspace and bucketname is by doing 2 lookups the WorkspaceConfig won't be duplicated.

private val workspaces = scala.collection.mutable.Map[String, WorkspaceConfig]()
private val workspacesByBucket = scala.collection.mutable.Map[String, WorkspaceConfig]()

def registerBucketDetails( workspaceId:String, bucketName: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will call these register functions? I guess the singleton is per JVM so needs to be called on driver and all executors?

For K8s we can easily make sure a file is present in the execution environment. Is the same true for YARN? Otherwise the loading can be done on initialization of the singleton.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are sync jobs a thing on YARN because? Because for jobs I guess at spark-submit time extra files can be handed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants