Support functionalities to enhance task traceability with metadata for dependency search.#450
Conversation
| K = TypeVar('K') | ||
|
|
||
|
|
||
| def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: |
There was a problem hiding this comment.
If we can use both Generics and isinstance at the same time, code would be below.
def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]:
if isinstance(items, dict):
return {k: map_flattenable_items(v, func) for k, v in items.items()}
if isinstance(str):
return items
if isinstance(items, Iterable[T]):
return [map_flattenable_items(i, func) for i in items]
return func(items)| @dataclass | ||
| class _RequiredTaskOutput: | ||
| task_name: str | ||
| output_path: str | ||
|
|
||
| _required_task_outputs = map_flattenable_items( | ||
| self.requires(), | ||
| func=lambda task: map_flattenable_items( | ||
| task.output(), func=lambda output: _RequiredTaskOutput(task_name=task.get_task_family(), output_path=output.path()) | ||
| ), | ||
| ) | ||
| required_task_outputs: dict[str, str] | None = None | ||
| if isinstance(_required_task_outputs, list): | ||
| required_task_outputs = {r.task_name: r.output_path for r in _required_task_outputs} | ||
| elif isinstance(_required_task_outputs, dict): | ||
| required_task_outputs = _required_task_outputs | ||
| else: | ||
| required_task_outputs = ( | ||
| {_required_task_outputs.task_name: _required_task_outputs.output_path} if isinstance(_required_task_outputs, _RequiredTaskOutput) else None | ||
| ) |
There was a problem hiding this comment.
[imo]
It would become more readable with extracting this section into a method, which returns required_task_outputs.
| lock_at_dump: bool = True, | ||
| task_params: dict[str, str] | None = None, | ||
| custom_labels: dict[str, Any] | None = None, | ||
| required_task_outputs: dict[str, str] | None = None, |
There was a problem hiding this comment.
[imo]
This parameter seems to be just a metadata. But its name may indicate that it effects the functionality of the method or the class's attribute. It would be better to rename for avoiding such a misleading.
|
LGTM |
|
@mamo3gr Thank you for your thoughtful comments. I'm gonna deal with all of them. |
| if isinstance(items, str): | ||
| return items # type: ignore |
There was a problem hiding this comment.
In this case, T means `str, so you should apply func for this.
| if isinstance(items, str): | |
| return items # type: ignore | |
| if isinstance(items, str): | |
| return func(items) # type: ignore |
| K = TypeVar('K') | ||
|
|
||
|
|
||
| def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: |
There was a problem hiding this comment.
https://docs.python.org/3.13/library/functions.html#map
python original map define map(function, iterable), so you must suit python's manner.
| def map_flattenable_items(items: FlattenableItems[T], func: Callable[[T], K]) -> FlattenableItems[K]: | |
| def map_flattenable_items(func: Callable[[T], K], items: FlattenableItems[T]) -> FlattenableItems[K]: |
| if isinstance(items, str): | ||
| return items # type: ignore | ||
| if isinstance(items, Iterable): | ||
| return [map_flattenable_items(i, func) for i in items] |
There was a problem hiding this comment.
When pass tuple[T], it should returns tuple[K]. But, in this implementation, this case is not cared for.
| continue | ||
| merged_labels[label_name] = label_value | ||
| merged_labels: dict[str, str] = {} | ||
| for normalized_label in normalized_labels_list[:]: |
There was a problem hiding this comment.
| for normalized_label in normalized_labels_list[:]: | |
| for normalized_label in normalized_labels_list: |
…uiredTaskOutput]]
|
@kitagry Sorry for late actions. |
| from gokart.required_task_output import RequiredTaskOutput | ||
| from gokart.utils import map_flattenable_items | ||
|
|
||
| if sys.version_info < (3, 13): |
There was a problem hiding this comment.
Maybe this part is not needed?
| local_temporary_directory=./resource/tmp | ||
|
|
||
| [core] | ||
| logging_conf_file=logging.ini No newline at end of file |
There was a problem hiding this comment.
[nits]
add end of newline
| copy.deepcopy(original_metadata), | ||
| task_params, | ||
| custom_labels, | ||
| required_task_outputs if required_task_outputs else None, |
There was a problem hiding this comment.
It seems to be redundant
| required_task_outputs if required_task_outputs else None, | |
| required_task_outputs, |
| # Instead, users are expected to search using the labels they provided. | ||
| # Therefore, in the event of a key conflict, the value registered by the user-provided labels will take precedence. | ||
| _merged_labels = GCSObjectMetadataClient._merge_custom_labels_and_task_params_labels(normalized_task_params_labels, normalized_custom_labels) | ||
| normalized_labels = ( |
There was a problem hiding this comment.
[imo]
I prefer this because of readability
| normalized_labels = ( | |
| normalized_labels = [normalized_custom_labels, normalized_task_params_labels] | |
| if not required_task_outputs | |
| normalized_labels.append({'__required_task_outputs': json.dumps(GCSObjectMetadataClient._get_serialized_string(required_task_outputs))}) |
| merged_labels: dict[str, str] = {} | ||
| for normalized_label in normalized_labels_list[:]: | ||
| for label_name, label_value in normalized_label.items(): | ||
| if len(label_value) == 0: |
There was a problem hiding this comment.
[MUST] This code may fail, since it seems to assume that label_value is str.
I prefer checking if it is str, and then check the length as,
isinstance(label_value, str) and len(label_value)==0
There was a problem hiding this comment.
Thank you for reviewing my code!
In my opinion, type checking is not necessary, because GCSObjectMetadataClient._normalize_labels convert all values stored in dictionary into string.
So, label_value definitely is string.
There was a problem hiding this comment.
@TlexCypher
Then maybe the input normalized_labels_list: list[dict[str, Any]] should be normalized_labels_list: list[dict[str, str]] ?
There was a problem hiding this comment.
@TlexCypher Colud you check this comment?
If you are confirmed that label_value is str, you should str instead of Any
| continue | ||
| merged_labels[label_name] = label_value | ||
| merged_labels: dict[str, str] = {} | ||
| for normalized_label in normalized_labels_list[:]: |
There was a problem hiding this comment.
[weak-IMO]
for normalized_label in normalized_labels_list:
for label_name, label_value in normalized_label.items():
if len(label_value) == 0:
I thought this part a bit difficult to understand, since it is deeply nested.
It may get better if you extract for label_name, label_value in... part as a separate function, and apply it with a functools.reduce().
However, current code is OK though. :)
There was a problem hiding this comment.
Thank you for great suggestion!
For this specific task of merging labels, the simple nested loop is likely more readable and Pythonic than using functools.reduce.
While reduce can be used, in this scenario, the straightforward nested loop (or perhaps the alternative 'flattening' approach) probably offers better clarity and maintainability.
How do you think?
There was a problem hiding this comment.
I preferred reduce approach, because it express the motivation of making merged_labels earlier, which makes the first time reader easier to understand.
merged_labels = reduce(...)
In the nested loop, you need to read to L.147 to understand the motivation of building merged_labels.
However, both approach is OK, since this is relatively small loop nest. :)
There was a problem hiding this comment.
You're right.
I put some changes to use functools.reduce.
Thank you for help!
mski-iksm
left a comment
There was a problem hiding this comment.
@TlexCypher
I've made some comments but mainly LGTM! Thank you for your contribution!
|
@kitagry Hey, could you check this PR again? |
kitagry
left a comment
There was a problem hiding this comment.
Sorry for the late review. Could you change custom_label type to dict[str, str] ?
| merged_labels: dict[str, str] = {} | ||
| for normalized_label in normalized_labels_list[:]: | ||
| for label_name, label_value in normalized_label.items(): | ||
| if len(label_value) == 0: |
There was a problem hiding this comment.
@TlexCypher Colud you check this comment?
If you are confirmed that label_value is str, you should str instead of Any
|
@kitagry Sorry for late reply. |
Related works
#445
#446
#448
What does PR do?
In this Pull Request, I implement a metadata attribution feature that enables searching for tasks dependent on specific tasks executed with a given parameter set.
Why is this needed?
Gokart caches the execution results and parameter states of each task in GCS. As shown in the Related Works section, various metadata are attached to each GCS object to enhance traceability. A common use case is searching for tasks that depend on a specific task executed with a given parameter set. Currently, Gokart does not support searching and tracing task dependencies from GCS metadata. This PR introduces this functionality.
Pre-Requisists
The focus of this PR is embedding the necessary metadata to allow the CLI to search for specific dependencies. The search functionality itself will be implemented on the CLI side (CLI: https://github.com/TlexCypher/gcs-metadog).
Checklist
CI is passing
Code formatting follows project standards.
Necessary tests have been added.
Existing tests pass.