Support functionalities to add user-provided, original labels. by tyzerrr · Pull Request #446 · m3dev/gokart

tyzerrr · 2025-03-03T08:00:47Z

Related works

By merging this PR, gokart can add task's parameter information to gcs cache.

What does this PR do?

This PR adds support for user-provided, original labels when saving a task output to GCS.

Why is this needed?

Users of gokart can store some data like logs, results, parameter information etc. as cache in Google Cloud Storage(GCS).
For now, gokart can provide labels, generated from each task parameter information when saving files to GCS.
But, when getting number of cached objects growing, in some cases, user want to add original labels to cache for more ease filtering.
So in this PR, support those kind way.

Checklist

CI is passing
Code formatting follows project standards.
Necessary tests have been added.
Existing tests pass.

…a to gcs-object.

…rted another PR.

… namespace support feature and test it.

kitagry · 2025-03-03T08:43:38Z

+    def dump(self, obj: Any, target: Union[str, TargetOnKart], user_provided_labels: Optional[dict[Any, Any]] = None) -> None: ...

-    def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None:
+    def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None:


maybe, user_provided_labels name is confusing. when user call task.dump, user can't predict what the labels will be.

I have some candidates for this variable.
I go for custom_labels.

kitagry · 2025-03-03T08:44:10Z

+    def dump(self, obj: Any, target: Union[str, TargetOnKart], user_provided_labels: Optional[dict[Any, Any]] = None) -> None: ...

-    def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None:
+    def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None:


I think label's key should be string.

kitagry · 2025-03-03T08:48:05Z

+        return dict(metadata) | dict(labels)
+
+    @staticmethod
+    def _add_labels_to_metadata(labels_dict, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys):


Could you add type for args?

adieumonks · 2025-03-03T09:25:59Z

+    @staticmethod
+    def _normalize_labels(task_params: Optional[dict[Any, str]], user_provided_labels: Optional[dict[Any, Any]]) -> tuple[dict[Any, str], dict[Any, str]]:
+        def _normalize_labels_helper(_params: Optional[dict[Any, Any]]) -> dict[Any, str]:
+            return {str(key): str(value) for key, value in _params.items()} if _params else {}
+
+        return (
+            _normalize_labels_helper(task_params),
+            _normalize_labels_helper(user_provided_labels),
+        )
+


If _normalize_labels does the same thing for both task_params and user_provided_labels, it would be better to have a single argument (e.g., labels) and call it twice.

adieumonks · 2025-03-03T09:30:48Z

+        total_metadata_size, labels = GCSObjectMetadataClient._add_labels_to_metadata(
+            normalized_user_provided_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
+        )
+        _, labels = GCSObjectMetadataClient._add_labels_to_metadata(
+            normalized_task_params_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
+        )


I think GCSObjectMetadataClient only needs to focus on adding metadata to the cache and doesn't necessarily need to be aware of the differences between task_params and user_provided_labels

@adieumonks Thanks for your comment. I understand that your point is that GCSObjectMetadataClient shouldn't need to distinguish between user-provided labels and labels generated from parameters. I agree with that, and one possible approach could be to merge both sets of labels beforehand as a preprocessing step. However, as mentioned in the comment, there is a possibility that the keys of user-provided labels and parameter-generated labels could conflict. Based on the use case, I want to prioritize the values of user-provided labels. If we implement this in a straightforward way, we would end up with multiple for loops performing similar processing. To avoid that redundancy, I extracted the logic into the _add_labels_to_metadata function, which led to the current implementation.

Would it be a bad idea to overwrite task_params with user_provided_labels before calling the function, like GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels)?

I don't know what GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels) means, so I search it. From Python3.9, | operator merges dictionary, and right side is prioritized.
Then, I wanna log key conflict event, because if event is captured as log, user can know key conflicts, but if we adopt your suggestion, hard to log it.
What do you think?

Off course, we number total keys, then after merging two dictionary, if the number of total key were changed, key conflicts might be happened, so we can log it.
But, some meaningless code would be written, to detect what exact keys cause key conflicts.

…pply changes to test.

hiro-o918 · 2025-03-04T04:22:06Z

+    @staticmethod
+    def _is_log_related_path(path: str) -> bool:
+        return (
+            ('log/random_seed' in path)


how about this?

Suggested change

('log/random_seed' in path)

return path.startwith('log/')

In my opinion, I think your suggeston is not acceptable.
Because if user wants to change default place to dump from main to log/, no labels would be added.
The reason why I write as following implementation, log/{hogehoge} is hard coded in gokart implementation, so these passes are fixed.

@TlexCypher
OK!
But I cannot find the reason to specify the subpaths, please add comments the reasons.

Also, imo, using regex is more suitable for this case like r'^log/(processing_time|...).+\.txt$?
Current implementation will be true for paths likefoo/log/pprocessing_time.
Is this correct?

Using regex is better, as you said.

hiro-o918 · 2025-03-04T05:57:29Z

+        return dict(metadata) | dict(labels)
+
+    @staticmethod
+    def _add_labels_to_metadata(


This function name is ambiguous a little.
what does metadata mean?

how about the following pattern?
please let me know your thoughts.

@dataclass class LabelMetadata: labels: dict[str, str] = Field(default_factory=dict) label_keys: : set[str] = Field(default_factory=set) total_metadata_size: int

Suggested change

def _add_labels_to_metadata(

def _create_or_append_metadata(

labels: dict[str, str],

max_gcs_metadata_size,

current_metadata: Optional[LabelMetadata] = None,

) -> LabelMetadata

I think creating a data class just for this is redundant.
It would be better to simply rename the method.

hiro-o918

LGTM!

yokomotod · 2025-03-05T01:37:59Z

                logger.error(f'failed to patch object {obj} in bucket {bucket} and object {obj}.')

+    @staticmethod
+    def _normalize_labels(labels: Optional[dict[str, Any]]) -> dict[str, str]:


[nits]

you can write dict[str, Any] | None since 3.10, or, 3.7 with from __future__ import annotations

Thank you for educational and helpful comment!
I'm not sure Optional expression is old expression.

kitagry

I added nits comment. LGTM

kitagry · 2025-03-05T01:45:32Z

+        total_metadata_size, labels, has_seen_keys = GCSObjectMetadataClient._add_labels_with_size_limitation(
+            normalized_custom_labels, total_metadata_size, max_gcs_metadata_size
+        )
+        _, labels, _ = GCSObjectMetadataClient._add_labels_with_size_limitation(
+            normalized_task_params_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
+        )


I think these methods are a bit confusing. I think we can separate two phase.

merge normalized_task_params_labels and normalized_custom_labels

delete oversized labels

I think you shouldn't multiple things in a method.

Hmm, this is thoughtful comment.
Basically, I agree to you. One of the coding principle is responsibility separation.
But in this case, we need to add custom labels before parameter based labels.
I searched dictionary specification, from Python3.7, dictionary preserves the order in which they were added.
Then, even if adopt your separation suggestion, easily to this objective, that's great.

…tead of Optional.

hirosassa

LGTM!

kitagry · 2025-03-05T04:26:25Z

-        )
-        return dict(metadata) | dict(labels)
+        _merged_labels = GCSObjectMetadataClient._merge_custom_labels_and_task_params_labels(normalized_task_params_labels, normalized_custom_labels)
+        return dict(metadata) | dict(GCSObjectMetadataClient._adjust_gcs_metadata_limit_size(_merged_labels))


When GCSObjectMetadataClient._adjust_gcs_metadata_limit_size(_merged_labels) has 7.9KiB and metadata has 0.2KiB, it will be more than 8KiB. It is ok?

Ahh... sorry, try to recontribute.

荒木太一 added 17 commits February 27, 2025 12:58

WIP: implement GCSObjectMetadataClient class to attach custom-metadat…

0aaddca

…a to gcs-object.

WIP: change dump method interface.

dce1a73

WIP: add user_provided_gcs_labels parameter to TaskOnKart.

6b0dc4c

Test: add test of GCSObjectMetadataClient.

ed86bcb

feat: dealed with nits PR comments.

38fa924

feat: Remove user_provided_labels feature. This feature will be suppo…

fcfdd56

…rted another PR.

for-PR: apply almost all comments.

de1fb45

fix: change Dict to dict.

c11abea

feat: add gokart specific parameter serialize test.

e471163

fix: fix testcases with literals and more meaningful assertion.

47195b9

feat: add mock testcase.

c8931f2

fix: fix CI errors.

cc320b5

feat: deal with pr comments, and modify testcases.

d8c6ec5

feat: deal with kitagry comments.

ca6d69e

feat: Supportfunctionalities to add user specific original labels and…

08549a7

… namespace support feature and test it.

feat: remove namespace feature.

a73801f

feat: resolve conflicts.

aa33064

kitagry requested changes Mar 3, 2025

View reviewed changes

荒木太一 added 4 commits March 3, 2025 18:25

feat: Deal with kitagry PR comments. Added type annotations.

a047f6a

CI: apply ruff

2d4322c

feat: rename user_provided labels to custom_labels.

6e80c2f

CI: apply ruff

f62742e

tyzerrr requested a review from kitagry March 3, 2025 09:29

adieumonks reviewed Mar 3, 2025

View reviewed changes

feat: Deal with PR comments. Change _normalize_labels signature and a…

91b8282

…pply changes to test.

tyzerrr requested a review from adieumonks March 4, 2025 01:31

hiro-o918 requested changes Mar 4, 2025

View reviewed changes

荒木太一 added 3 commits March 4, 2025 13:56

feat: deal with PR comments.

6364394

feat: deal with PR comments.

d7cfcdc

CI: fix tests

13dc761

tyzerrr requested a review from hiro-o918 March 4, 2025 05:28

hiro-o918 reviewed Mar 4, 2025

View reviewed changes

feat: deal with PR comments.

a4d6a0b

tyzerrr requested a review from hiro-o918 March 5, 2025 00:28

CI: apply ruff

3315be5

hiro-o918 approved these changes Mar 5, 2025

View reviewed changes

yokomotod reviewed Mar 5, 2025

View reviewed changes

kitagry approved these changes Mar 5, 2025

View reviewed changes

荒木太一 added 2 commits March 5, 2025 11:30

feat: deal with kitagry PR comments, responsibility separation.

fc4b1fb

feat: deal with yokomotod PR comments, use hoge | None expression ins…

337fab8

…tead of Optional.

tyzerrr requested a review from yokomotod March 5, 2025 02:46

hirosassa approved these changes Mar 5, 2025

View reviewed changes

yokomotod approved these changes Mar 5, 2025

View reviewed changes

kitagry reviewed Mar 5, 2025

View reviewed changes

inakam merged commit 77d6a4a into m3dev:master Mar 5, 2025
8 checks passed

This was referenced Mar 5, 2025

Fix a small bug for GCS object-custom-metadata size limitation. #448

Merged

Support functionalities to enhance task traceability with metadata for dependency search. #450

Merged

tyzerrr mentioned this pull request Apr 17, 2025

Enhance tracability of cache object stored in Amazon S3 as well as GCS #463

Open

-    def _add_labels_to_metadata(
+    def _create_or_append_metadata(
+            labels: dict[str, str],
+            max_gcs_metadata_size,
+            current_metadata: Optional[LabelMetadata] = None,
+) -> LabelMetadata

Conversation

tyzerrr commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related works

What does this PR do?

Why is this needed?

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hiro-o918 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kitagry left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyzerrr Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hirosassa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

tyzerrr commented Mar 3, 2025 •

edited

Loading

tyzerrr Mar 3, 2025 •

edited

Loading

tyzerrr Mar 4, 2025 •

edited

Loading

tyzerrr Mar 4, 2025 •

edited

Loading

tyzerrr Mar 4, 2025 •

edited

Loading

tyzerrr Mar 4, 2025 •

edited

Loading

tyzerrr Mar 5, 2025 •

edited

Loading