Conversation
143a0a1 to
2e797b1
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
=======================================
Coverage ? 96.78%
=======================================
Files ? 13
Lines ? 872
Branches ? 0
=======================================
Hits ? 844
Misses ? 28
Partials ? 0
🚀 New features to boost your workflow:
|
aad3562 to
2cc418f
Compare
gtca
left a comment
There was a problem hiding this comment.
Thanks for the fixes!
I am also wondering if this fixes specific use cases that were failing before, can we add them to the our tests?
| assert v is None, f"Cannot use {k} with columns." | ||
| if v is not None: | ||
| warnings.warn( | ||
| f"Both columns and {k} given. Columns take precedence, {k} will be ignored", |
There was a problem hiding this comment.
Would something like this improve readability? (I am not sure we have a consistent policy for formatting in such cases.)
Both `columns=...` and `{k}=True` were given. <...>
There was a problem hiding this comment.
If yes, this is also true for similar warnings in other parts of the PR.
There was a problem hiding this comment.
I think that would be a bit misleading here, since the warning will also be emitted if k=False or any other value which is not the None default. Perhaps something like
Both `columns=...` and `{k}={locals()[k]}` were given...
? But I'm not sure if that brings the message across that it should just be not passed at all (as in leave the None default).
2cc418f to
ca406d7
Compare
I did add a test for the one bugfix that this contains (the ordering of pushed columns). The rest of this PR is more quality of life / logic improvements. I can add tests for those, but I'd prefer to do that as part of a larger effort to improve test coverage. |
ilan-gold
left a comment
There was a problem hiding this comment.
Generally just found it a bit hard to reason about what things are done to the columns, so maybe a class would help?
| common | ||
| If True, pull common columns. | ||
| Common columns do not have modality prefixes. | ||
| Pull from all modalities. |
There was a problem hiding this comment.
So if you have a column that is present in all modalities (AnnData objects) and is named, say batch, it will be pulled and its name in the global dataframe will also be batch, so without any modalitiy prefix.
| only_drop | ||
| If True, drop the columns but do not actually pull them. | ||
| Forces drop=True. False by default. |
There was a problem hiding this comment.
So this argument turns this function into a "drop columns" function instead of a "pull columns" function? Made add some context to why one might want to do this (deleting data from your underlying AnnData stores strikes me as a bad idea, but I'm sure there's a reason to do this in some sort of structure manner)
There was a problem hiding this comment.
That was originally implemented by @gtca, so he's best qualified to answer, but I think if you have a bunch of AnnDatas coming from somewhere (e.g. constructed from the output of a proprietary pipeline or some publication), there are often metadata that are either useless (the same value for all observations) or that you just don't care about in your analysis, in which case one can drop them to reduce visual clutter (e.g. when printing the dataframe in a notebook or tab-completing column names).
src/mudata/_core/mudata.py
Outdated
| if only_drop: | ||
| drop = True | ||
|
|
||
| cols = _classify_attr_columns( |
There was a problem hiding this comment.
I think cols should be a class, not just a dictionary, with methods to encapsulate the below iterations (and it's sub-dictionary value as well to handle the modcols logic)
There was a problem hiding this comment.
Generally I would agree with you, but this used at exactly one place in the entire codebase, so I think a class is a bit overkill at this point.
There was a problem hiding this comment.
Although I would consider making it a named tuple for performance reasons.
|
|
||
| if mods is not None: | ||
| cols = [col for col in cols if col["prefix"] in mods] | ||
| cols = { |
There was a problem hiding this comment.
i.e., with cols as a class this could be
prefix_to_cols = cols.filter_by_name_or_derived_name(colums)(I would also advocate changing the name from cols to prefix_to_cols to avoid confusion with columns)
src/mudata/_core/utils.py
Outdated
There was a problem hiding this comment.
The class seems redundant, more class-behavior. Isn't just a check on name?
There was a problem hiding this comment.
In general yes, but having this as a separate attribute makes the filtering in _push_attr() much easier.
There was a problem hiding this comment.
It's a big refactor but this gets back to my point about using classes. It just feels like _classify_attr_columns and _classify_prefixed_columns do such similar things and return such weak types that refactoring around using classes would be a good idea.
It's tough to say without doing it myself, but offhand, I don't see why this should return a list and not the same thing as _classify_attr_columns with methods for getting the needed information that might distinguish them. For example, moving prefix here into the key of the dictionary returned in classify_attr_columns seems. Even there, in that return type the name and derived_name are just slightly different than name and prefix here it seems.
I just generally don't like untyped dictionaries and magic strings and especially in this case, it seems like there's a decent amount of overlapping functionality. But if it's too much for this PR, that's fine. I think stronger typing on the return types is a good start.
There was a problem hiding this comment.
OK, there is now a custom class encapsulating one column and its logic. But moving all the filtering logic etc. into another class feels like overengineering to me tbh. If anything, I would get rid of _classify_attr_columns and just move the few lines into _pull_attr.
There was a problem hiding this comment.
It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:
or
I find these a little difficult to reason about, especially in the absence of comment strings.
There was a problem hiding this comment.
I added some more comments throughout.
1072cf6 to
0dc33db
Compare
- don't raise exception if mods is used together with common or prefixed there is nothing in the logic preventing it - don't raise if columns is used together with common or prefixed, warn instead - minor code cleanup
- don't raise exception if mods is used together with common, nonunique, or unique: there is nothing in the logic preventing it - don't raise if columns is used together with common, nonunique, or unique, warn instead - fix ordering of pushed column - minor code cleanup
0dc33db to
2b26ebf
Compare
ilan-gold
left a comment
There was a problem hiding this comment.
Definitely like the MetadataColumn class!
src/mudata/_core/utils.py
Outdated
There was a problem hiding this comment.
It's up to you. It's not so much about code correctness as it is about readability for the next person. For example:
or
I find these a little difficult to reason about, especially in the absence of comment strings.
_pull_attr/_push_attr
4e2177d to
21b2e83
Compare
ilan-gold
left a comment
There was a problem hiding this comment.
Are there explicit tests for the fixes? For example:
don't raise exception if mods is used together with common or prefixed (push) / common, nonunique, or unique (pull): There is nothing in the logic preventing that
deosn't seem to have a test. I searched through the file and mods is None by default and the other two instances of it in test_push_pull are only with columns=...
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
There is a test for |
| == mdata.mod[m].obs["common_obs_col"].to_numpy()[modmap[mask] - 1] | ||
| ).all() |
There was a problem hiding this comment.
From what I understand, the reason is that 0 encodes missing (why not NA?) so the obsmap indexing starts at 1 and then you need to shift by 1. Is this right?
Unrelated to the PR, why are these not nullables?
There was a problem hiding this comment.
Yes, exactly. When we first implemented this, AnnData's IO did not support nullables yet, so we went with this.
or unique (pull): There is nothing in the logic preventing that