-
Notifications
You must be signed in to change notification settings - Fork 31
Description
In the context of working on #957, I've noticed that something is off with the handling of non-contiguous dataframes. The bottom line is, that we don't handle non-contiguous dataframes correctly in glum, but this does not cause actual issues, due to a number of other bugs. Below are the details:
- By the time we get here, dataframes are converted into tabmat matrices, which this function treats as contiguous, so we don't copy.
- The relevant test does not catch this, because pandas stores its data in F-contiguous arrays, so this kind of indexing does not make it non-contiguous. We should try pd.DataFrame(X).iloc[2:, :] instead, which indeed makes the underlying array non-contiguous, but we still don't fail becasue...
- We accidentally (?) copy the data here for dense matrices. I.e., even if we don't make a copy in glum, we do it in tabmat. Even when we would not have to. This is a tabmat issue, but what we do with glum depends on how we handle it in tabmat.
Fixing the test is trivial. Fixing is_contagious for dataframes is also doable, although doing it in a dataframe-agnostic and performant way requires some care (maybe special-case pandas polars?). Fixing the accidental copy in tabmat is a problem, though. Although the fix is simple, we would break all current and older versions of glum when working with non-contiguous pandas dataframes, so we would have to do a repodata patch and add coordinate the releases.
How should we go forward? I see a number of options.
- Leave things as they are (tabmat always copies dense matrices) as users seem to be fine with that and it works. Remove contiguous dataframe (not matrix!) handling from glum. Remove the
copy_Xparameter from glum as it does not do what it says on the tin. - Fix tabmat, do the repodata patch for glum, handle non-contiguous matrices properly in glum.
- Let tabmat handle copying the data if (and only if!) it needs to (different dtype or non-contiguous array). Remove handling non-contiguity from glum (and probably also deprecate the
copy_Xargument).