add singleton to track bucket/workspace information in scala#443
add singleton to track bucket/workspace information in scala#443
Conversation
pvbouwel
left a comment
There was a problem hiding this comment.
I like the idea of the PR but implementation wise there are some unknowns and I would have some things different
| val isCloudFerro = s3Endpoint != null && | ||
| (s3Endpoint.toLowerCase.contains("cloudferro") || s3Endpoint.toLowerCase.endsWith(".dataspace.copernicus.eu")) | ||
|
|
||
| val maybeWorkspace = WorkspaceRepository.get().getWorkspaceByBucket(s3Uri.getBucket) |
There was a problem hiding this comment.
I don't believe this logic is best placed here. CreoS3Utils has helpers like getCreoS3Client so if we solve the problem here in MultiClientRangeReader we'd have to solve it in other places as well. Ideally there is one factory for creating S3 clients. There are however multiple problems with getCreoS3Client and it's consumption:
- It assumes the region is known or defaults to 'RegionOne'
a. a hard coded default does not make sense imho
b. the placeholder value would cause the endpoint to be resolved via an environment variableSWIFT_URLbut it would not change the region. If sigv4 checking is done stricly then it would lead to authorization failures. - From a consumption part it often called in the same file without specifying an argument
I like the idea of a Workspace Repository. Python could provision the workspaces info that is potentially encountered.
So a ClientFactory method that just takes a bucket name (or S3URI from which it can extract the bucket name) use the workspace repository to resolve bucket to S3 details (region + endpoint or region+profile) and potentially falls back to the legacy resolution would be usable in both places.
| */ | ||
|
|
||
| private val workspaces = scala.collection.mutable.Map[String, WorkspaceConfig]() | ||
| private val workspacesByBucket = scala.collection.mutable.Map[String, WorkspaceConfig]() |
There was a problem hiding this comment.
Is there a particular reason not to just have a Map[String, String]. A 1-to-1 relationship between workspace and bucketname is by doing 2 lookups the WorkspaceConfig won't be duplicated.
| private val workspaces = scala.collection.mutable.Map[String, WorkspaceConfig]() | ||
| private val workspacesByBucket = scala.collection.mutable.Map[String, WorkspaceConfig]() | ||
|
|
||
| def registerBucketDetails( workspaceId:String, bucketName: String, |
There was a problem hiding this comment.
What will call these register functions? I guess the singleton is per JVM so needs to be called on driver and all executors?
For K8s we can easily make sure a file is present in the execution environment. Is the same true for YARN? Otherwise the loading can be done on initialization of the singleton.
There was a problem hiding this comment.
Are sync jobs a thing on YARN because? Because for jobs I guess at spark-submit time extra files can be handed.
#441