Skip to content

Incorrect handling of dataframes with non-contiguous data #959

@stanmart

Description

@stanmart

In the context of working on #957, I've noticed that something is off with the handling of non-contiguous dataframes. The bottom line is, that we don't handle non-contiguous dataframes correctly in glum, but this does not cause actual issues, due to a number of other bugs. Below are the details:

  • By the time we get here, dataframes are converted into tabmat matrices, which this function treats as contiguous, so we don't copy.
  • The relevant test does not catch this, because pandas stores its data in F-contiguous arrays, so this kind of indexing does not make it non-contiguous. We should try pd.DataFrame(X).iloc[2:, :] instead, which indeed makes the underlying array non-contiguous, but we still don't fail becasue...
  • We accidentally (?) copy the data here for dense matrices. I.e., even if we don't make a copy in glum, we do it in tabmat. Even when we would not have to. This is a tabmat issue, but what we do with glum depends on how we handle it in tabmat.

Fixing the test is trivial. Fixing is_contagious for dataframes is also doable, although doing it in a dataframe-agnostic and performant way requires some care (maybe special-case pandas polars?). Fixing the accidental copy in tabmat is a problem, though. Although the fix is simple, we would break all current and older versions of glum when working with non-contiguous pandas dataframes, so we would have to do a repodata patch and add coordinate the releases.

How should we go forward? I see a number of options.

  1. Leave things as they are (tabmat always copies dense matrices) as users seem to be fine with that and it works. Remove contiguous dataframe (not matrix!) handling from glum. Remove the copy_X parameter from glum as it does not do what it says on the tin.
  2. Fix tabmat, do the repodata patch for glum, handle non-contiguous matrices properly in glum.
  3. Let tabmat handle copying the data if (and only if!) it needs to (different dtype or non-contiguous array). Remove handling non-contiguity from glum (and probably also deprecate the copy_X argument).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions