Skip to content

Support functionalities to add user-provided, original labels.#446

Merged
inakam merged 29 commits intom3dev:masterfrom
tyzerrr:feat/gcs-user-provided-label
Mar 5, 2025
Merged

Support functionalities to add user-provided, original labels.#446
inakam merged 29 commits intom3dev:masterfrom
tyzerrr:feat/gcs-user-provided-label

Conversation

@tyzerrr
Copy link
Copy Markdown
Contributor

@tyzerrr tyzerrr commented Mar 3, 2025

Related works

#445

By merging this PR, gokart can add task's parameter information to gcs cache.

What does this PR do?

This PR adds support for user-provided, original labels when saving a task output to GCS.

Why is this needed?

Users of gokart can store some data like logs, results, parameter information etc. as cache in Google Cloud Storage(GCS).
For now, gokart can provide labels, generated from each task parameter information when saving files to GCS.
But, when getting number of cached objects growing, in some cases, user want to add original labels to cache for more ease filtering.
So in this PR, support those kind way.

Checklist

  • CI is passing
  • Code formatting follows project standards.
  • Necessary tests have been added.
  • Existing tests pass.

Comment thread gokart/task.py Outdated
def dump(self, obj: Any, target: Union[str, TargetOnKart], user_provided_labels: Optional[dict[Any, Any]] = None) -> None: ...

def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None:
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, user_provided_labels name is confusing. when user call task.dump, user can't predict what the labels will be.

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some candidates for this variable.
I go for custom_labels.

Comment thread gokart/task.py Outdated
def dump(self, obj: Any, target: Union[str, TargetOnKart], user_provided_labels: Optional[dict[Any, Any]] = None) -> None: ...

def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None) -> None:
def dump(self, obj: Any, target: Union[None, str, TargetOnKart] = None, user_provided_labels: Optional[dict[Any, Any]] = None) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think label's key should be string.

Comment thread gokart/gcs_obj_metadata_client.py Outdated
return dict(metadata) | dict(labels)

@staticmethod
def _add_labels_to_metadata(labels_dict, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add type for args?

@tyzerrr tyzerrr requested a review from kitagry March 3, 2025 09:29
Comment on lines +80 to +89
@staticmethod
def _normalize_labels(task_params: Optional[dict[Any, str]], user_provided_labels: Optional[dict[Any, Any]]) -> tuple[dict[Any, str], dict[Any, str]]:
def _normalize_labels_helper(_params: Optional[dict[Any, Any]]) -> dict[Any, str]:
return {str(key): str(value) for key, value in _params.items()} if _params else {}

return (
_normalize_labels_helper(task_params),
_normalize_labels_helper(user_provided_labels),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _normalize_labels does the same thing for both task_params and user_provided_labels, it would be better to have a single argument (e.g., labels) and call it twice.

Comment thread gokart/gcs_obj_metadata_client.py Outdated
Comment on lines +115 to +120
total_metadata_size, labels = GCSObjectMetadataClient._add_labels_to_metadata(
normalized_user_provided_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
)
_, labels = GCSObjectMetadataClient._add_labels_to_metadata(
normalized_task_params_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GCSObjectMetadataClient only needs to focus on adding metadata to the cache and doesn't necessarily need to be aware of the differences between task_params and user_provided_labels

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adieumonks Thanks for your comment. I understand that your point is that GCSObjectMetadataClient shouldn't need to distinguish between user-provided labels and labels generated from parameters. I agree with that, and one possible approach could be to merge both sets of labels beforehand as a preprocessing step. However, as mentioned in the comment, there is a possibility that the keys of user-provided labels and parameter-generated labels could conflict. Based on the use case, I want to prioritize the values of user-provided labels. If we implement this in a straightforward way, we would end up with multiple for loops performing similar processing. To avoid that redundancy, I extracted the logic into the _add_labels_to_metadata function, which led to the current implementation.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a bad idea to overwrite task_params with user_provided_labels before calling the function, like GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels)?

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what GCSObjectMetadataClient.add_task_state_labels(task_params | user_provided_labels) means, so I search it. From Python3.9, | operator merges dictionary, and right side is prioritized.
Then, I wanna log key conflict event, because if event is captured as log, user can know key conflicts, but if we adopt your suggestion, hard to log it.
What do you think?

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off course, we number total keys, then after merging two dictionary, if the number of total key were changed, key conflicts might be happened, so we can log it.
But, some meaningless code would be written, to detect what exact keys cause key conflicts.

@tyzerrr tyzerrr requested a review from adieumonks March 4, 2025 01:31
Comment thread gokart/gcs_obj_metadata_client.py Outdated
Comment thread gokart/gcs_obj_metadata_client.py Outdated
Comment thread gokart/gcs_obj_metadata_client.py Outdated
Comment thread test/test_gcs_obj_metadata_client.py Outdated
Comment thread test/test_gcs_obj_metadata_client.py Outdated
Comment thread gokart/gcs_obj_metadata_client.py Outdated
@staticmethod
def _is_log_related_path(path: str) -> bool:
return (
('log/random_seed' in path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about this?

Suggested change
('log/random_seed' in path)
return path.startwith('log/')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, I think your suggeston is not acceptable.
Because if user wants to change default place to dump from main to log/, no labels would be added.
The reason why I write as following implementation, log/{hogehoge} is hard coded in gokart implementation, so these passes are fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TlexCypher
OK!
But I cannot find the reason to specify the subpaths, please add comments the reasons.

Also, imo, using regex is more suitable for this case like r'^log/(processing_time|...).+\.txt$?
Current implementation will be true for paths likefoo/log/pprocessing_time.
Is this correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using regex is better, as you said.

@tyzerrr tyzerrr requested a review from hiro-o918 March 4, 2025 05:28
Comment thread gokart/gcs_obj_metadata_client.py Outdated
return dict(metadata) | dict(labels)

@staticmethod
def _add_labels_to_metadata(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is ambiguous a little.
what does metadata mean?

how about the following pattern?
please let me know your thoughts.

@dataclass
class LabelMetadata:
    labels: dict[str, str] = Field(default_factory=dict)
    label_keys: : set[str] = Field(default_factory=set)
    total_metadata_size: int
Suggested change
def _add_labels_to_metadata(
def _create_or_append_metadata(
labels: dict[str, str],
max_gcs_metadata_size,
current_metadata: Optional[LabelMetadata] = None,
) -> LabelMetadata

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating a data class just for this is redundant.
It would be better to simply rename the method.

@tyzerrr tyzerrr requested a review from hiro-o918 March 5, 2025 00:28
Copy link
Copy Markdown
Contributor

@hiro-o918 hiro-o918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread gokart/gcs_obj_metadata_client.py Outdated
logger.error(f'failed to patch object {obj} in bucket {bucket} and object {obj}.')

@staticmethod
def _normalize_labels(labels: Optional[dict[str, Any]]) -> dict[str, str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nits]

you can write dict[str, Any] | None since 3.10, or, 3.7 with from __future__ import annotations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for educational and helpful comment!
I'm not sure Optional expression is old expression.

Copy link
Copy Markdown
Member

@kitagry kitagry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added nits comment. LGTM

Comment thread gokart/gcs_obj_metadata_client.py Outdated
Comment on lines +105 to +110
total_metadata_size, labels, has_seen_keys = GCSObjectMetadataClient._add_labels_with_size_limitation(
normalized_custom_labels, total_metadata_size, max_gcs_metadata_size
)
_, labels, _ = GCSObjectMetadataClient._add_labels_with_size_limitation(
normalized_task_params_labels, total_metadata_size, max_gcs_metadata_size, labels, has_seen_keys
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these methods are a bit confusing. I think we can separate two phase.

  1. merge normalized_task_params_labels and normalized_custom_labels
  2. delete oversized labels

I think you shouldn't multiple things in a method.

Copy link
Copy Markdown
Contributor Author

@tyzerrr tyzerrr Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is thoughtful comment.
Basically, I agree to you. One of the coding principle is responsibility separation.
But in this case, we need to add custom labels before parameter based labels.
I searched dictionary specification, from Python3.7, dictionary preserves the order in which they were added.
Then, even if adopt your separation suggestion, easily to this objective, that's great.

@tyzerrr tyzerrr requested a review from yokomotod March 5, 2025 02:46
Copy link
Copy Markdown
Collaborator

@hirosassa hirosassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

)
return dict(metadata) | dict(labels)
_merged_labels = GCSObjectMetadataClient._merge_custom_labels_and_task_params_labels(normalized_task_params_labels, normalized_custom_labels)
return dict(metadata) | dict(GCSObjectMetadataClient._adjust_gcs_metadata_limit_size(_merged_labels))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GCSObjectMetadataClient._adjust_gcs_metadata_limit_size(_merged_labels) has 7.9KiB and metadata has 0.2KiB, it will be more than 8KiB. It is ok?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... sorry, try to recontribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants