Support tuple variable names in _subset_list#148
Support tuple variable names in _subset_list#148Chirag3841 wants to merge 4 commits intoarviz-devs:mainfrom
Conversation
src/arviz_base/utils.py
Outdated
| subset: Hashable | Sequence[Hashable] | None, | ||
| whole_list: Sequence[Hashable], | ||
| filter_items: str | None = None, | ||
| warn: bool = True, | ||
| check_if_present: bool = True, |
There was a problem hiding this comment.
we only use explicit type hints when we want to have type hints that differ from the docstring, otherwise, to keep a single source of truth we stick to having the info on the docstring only. (note that docstub automatically translates that to proper type hints to add to the respective .pyi files)
src/arviz_base/utils.py
Outdated
| if subset is not None: | ||
| if isinstance(subset, str): | ||
| subset = [subset] | ||
| elif isinstance(subset, tuple) and subset in whole_list: |
There was a problem hiding this comment.
I used tuple in the issue as an example, but the whole point of the issue was to do a deeper investigation into the different potentially valid cases and how we want them to behave. If you restrict to tuple then we aren't really matching xarray's behaviour, see for example:
v = frozenset({"a", "b"})
ds = xr.Dataset({frozenset({"a", "b"}): (("dim",), [1, 2, 3])})
ds[v]
# out
# <xarray.DataArray frozenset({'b', 'a'}) (dim: 3)> Size: 24B
# array([1, 2, 3])
# Dimensions without coordinates: dim| elif isinstance(subset, Sequence) and not isinstance(subset, str | bytes): | ||
| subset = list(subset) |
| real_items = [ | ||
| real_item | ||
| for real_item in whole_list | ||
| if isinstance(real_item, str) and pattern in real_item | ||
| ] |
There was a problem hiding this comment.
I am not sure this is what we want. IIUC, with this behaviour, if I use var_names="~theta", filter_vars="like" and I have as variable names ("theta", "original"), ("theta", "transformed"), and ("tau", "original") I end up plotting/keeping all the variables.
I think for like it would make more sense to exclude the first two. For regex I am much less sure if we want to try and do something complicated or keep things simple and ignore filter_vars completely in case of non-string elements.
Important note: This is a collaborative project and it is quite probably it will take a while until we all agree on a behaviour around this. I may have ideas, but me saying "I think this or that should happen" doesn't automatically mean this should be the behaviour of the library. It can be frustrating but you'll probably need some extra patience for this PR.
There was a problem hiding this comment.
Thanks for the clarification. I agree it’s better to align with xarray behavior and get consensus before finalizing anything. I’m happy to iterate based on feedback and adjust the implementation/tests as needed. Please let me know what target behavior you’d prefer and I can update the PR accordingly.
Check var_names behaviour and define what should be its type hint #83
xarray supports any hashable type as a variable or dimension name, including tuples such as ("tuple", "name"). This PR updates _subset_list to handle tuple names correctly, avoiding the current behavior where tuple inputs may be interpreted as multiple names.
Tuple inputs are treated as a single item when they exist in whole_list, otherwise they are treated as a container of names. Additionally, string-based filtering (filter_items="like" / "regex") is now applied only to string patterns and items to prevent type errors when non-string hashables are present. The membership validation was also updated to avoid NumPy failures with mixed hashable types.
Tests have been added to cover tuple variable name selection and filtering behavior.