feat!: Convert R factors to Polars Enum and keep all levels when converting Polars Enum to R factors#1723
feat!: Convert R factors to Polars Enum and keep all levels when converting Polars Enum to R factors#1723etiennebacher wants to merge 9 commits intomainfrom
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
eitsupi
left a comment
There was a problem hiding this comment.
Nice!
I was thinking something similar, but note that Python Polars recently exposed a new argument to Categorical, and I was thinking about bringing that into r-polars, but wanted to move forward with S7-ization first.
I agree that the conversion to Enum is useful, but since Enum is marked as experimental in r-polars, I think it should be handled with the factor -> Enum conversion options argument.
Ah interesting, I missed this addition. I'll see if I can bring this into |
|
Actually, I have only used factor to make ggplot2 pretty, so I'm not sure whether it behaves more like Polar's Categorical or Enum. |
|
Honestly I'm getting a bit lost between enums and categoricals. I will use the current branch to try to implement some
We can always cast to another Enum with more categories. library(polars)
dat <- pl$DataFrame(
x = factor(c("a", "b"), levels = c("a", "b", "c")),
y = factor(c("d", "e"), levels = c("d", "e"))
)
dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a │
#> │ b │
#> │ c │
#> └─────┘
dat$cast(x = pl$Enum(c("a", "b", "c", "d", "e")))$select(pl$col("x")$cat$get_categories())
#> shape: (5, 1)
#> ┌─────┐
#> │ x │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a │
#> │ b │
#> │ c │
#> │ d │
#> │ e │
#> └─────┘ |
|
I think this branch can basically be merged, but it requires an additional argument to |
R/as_polars_series.R
Outdated
| opt <- getOption("polars.factor_as_enum") | ||
| factor_as_enum <- opt %||% factor_as_enum |
There was a problem hiding this comment.
What does this intended to?
There was a problem hiding this comment.
The idea is to allow one to set options(polars.factor_as_enum = TRUE) but I think the logic is wrong. We should use the global option if factor_as_enum is missing in the call.
There was a problem hiding this comment.
Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.
factor_as_enum <- use_option_if_missing(
factor_as_enum,
missing(factor_as_enum),
FALSE,
"polars."
)options(polars.factor_as_enum = TRUE)
data.frame(x = factor(1:2), y = factor(1:2)) |>
as_polars_df()`factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE`
`factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE`
shape: (2, 2)
┌──────┬──────┐
│ x ┆ y │
│ --- ┆ --- │
│ enum ┆ enum │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
└──────┴──────┘
There was a problem hiding this comment.
Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.
I think you need to handle the argument here to avoid that.
Lines 178 to 209 in 28fbc06
There was a problem hiding this comment.
Do you mean I should add factor_as_enum after ... in the function definition, or that I should just call getOption("polars.factor_as_enum") in as_polars_df.list?
There was a problem hiding this comment.
Thinking about it again, I feel that overriding with options is dangerous as it could potentially destroy behavior everywhere.
Could you keep it as it cannot be used except by explicit specification?
I think we should improve the R -> Polars and Polars -> R conversions for factors.
Categorical variables are weird in Polars
With the Categorical refactor from a few months ago in upstream Polars, there's a single pool for all categories, which leads to this kind of situation where levels of all columns are mixed together:
I think R factors are much closer to Enums than to Categoricals (honestly I'm not even sure that Polars Categoricals are useful anymore):
-> R factors should be converted to Polars enums
Converting a Categorical to R loses the levels information
This happens a lot in
tidypolarstests: I start fromiris(in DataFrame format), filter out all rows that match a certain Categorical value, and convert todata.frame. But then this loses the information about the level that was filtered out while the same operation intidyversekeeps the levels info.In this example, we don't even need to filter since not all levels are represented in the data:
It would be nice to keep all the levels here.
-> Polars enums should keep all the levels information when converted to R factors
@eitsupi This PR is incomplete, I didn't update the tests for now because I wanted to see if you're ok with it first (and this would be a breaking change). There are about ~60 failures but it seems they would all be solved by an extra
$cast()or a snapshot update.Almost all changes have been done by Claude, but it looks okay to me.
Same examples with this PR:
Related to #1597 and #1146