Skip to content

feat!: Convert R factors to Polars Enum and keep all levels when converting Polars Enum to R factors#1723

Open
etiennebacher wants to merge 9 commits intomainfrom
use-enum-for-factors
Open

feat!: Convert R factors to Polars Enum and keep all levels when converting Polars Enum to R factors#1723
etiennebacher wants to merge 9 commits intomainfrom
use-enum-for-factors

Conversation

@etiennebacher
Copy link
Collaborator

@etiennebacher etiennebacher commented Jan 27, 2026

I think we should improve the R -> Polars and Polars -> R conversions for factors.

Categorical variables are weird in Polars

With the Categorical refactor from a few months ago in upstream Polars, there's a single pool for all categories, which leads to this kind of situation where levels of all columns are mixed together:

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat$select(pl$col("x")$cat$get_categories())
#> shape: (4, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ d   │
#> │ e   │
#> └─────┘

I think R factors are much closer to Enums than to Categoricals (honestly I'm not even sure that Polars Categoricals are useful anymore):

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)$cast(x = pl$Enum(c("a", "b", "c")), y = pl$Enum(c("d", "e")))

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

-> R factors should be converted to Polars enums

Converting a Categorical to R loses the levels information

This happens a lot in tidypolars tests: I start from iris (in DataFrame format), filter out all rows that match a certain Categorical value, and convert to data.frame. But then this loses the information about the level that was filtered out while the same operation in tidyverse keeps the levels info.

In this example, we don't even need to filter since not all levels are represented in the data:

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat |>
  as.data.frame() |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b"

data.frame(x = factor(c("a", "b"), levels = c("a", "b", "c"))) |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b" "c"

It would be nice to keep all the levels here.

-> Polars enums should keep all the levels information when converted to R factors


@eitsupi This PR is incomplete, I didn't update the tests for now because I wanted to see if you're ok with it first (and this would be a breaking change). There are about ~60 failures but it seems they would all be solved by an extra $cast() or a snapshot update.

Almost all changes have been done by Claude, but it looks okay to me.

Same examples with this PR:

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat
#> shape: (2, 2)
#> ┌──────┬──────┐
#> │ x    ┆ y    │
#> │ ---  ┆ ---  │
#> │ enum ┆ enum │  # ----> now "enum", not "cat" anymore
#> ╞══════╪══════╡
#> │ a    ┆ d    │
#> │ b    ┆ e    │
#> └──────┴──────┘

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

dat |>
  as.data.frame() |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b" "c"

Related to #1597 and #1146

This comment was marked as spam.

@etiennebacher

This comment was marked as outdated.

Copy link
Collaborator

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

I was thinking something similar, but note that Python Polars recently exposed a new argument to Categorical, and I was thinking about bringing that into r-polars, but wanted to move forward with S7-ization first.

I agree that the conversion to Enum is useful, but since Enum is marked as experimental in r-polars, I think it should be handled with the factor -> Enum conversion options argument.

@etiennebacher
Copy link
Collaborator Author

I was thinking something similar, but note that Python Polars recently exposed a new argument to Categorical

Ah interesting, I missed this addition. I'll see if I can bring this into r-polars in another PR and then we can determine if this one is still necessary.

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 28, 2026

Actually, I have only used factor to make ggplot2 pretty, so I'm not sure whether it behaves more like Polar's Categorical or Enum.
The disadvantage of Enum is that we can't add other category later.

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Jan 28, 2026

Honestly I'm getting a bit lost between enums and categoricals. I will use the current branch to try to implement some forcats functions in tidypolars so I have a better idea of whether enums cover all the use cases.

The disadvantage of Enum is that we can't add other category later.

We can always cast to another Enum with more categories.

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

dat$cast(x = pl$Enum(c("a", "b", "c", "d", "e")))$select(pl$col("x")$cat$get_categories())
#> shape: (5, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> │ d   │
#> │ e   │
#> └─────┘

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 28, 2026

I think this branch can basically be merged, but it requires an additional argument to as_polars_series.factor to opt into the behavior.

@etiennebacher etiennebacher marked this pull request as ready for review January 28, 2026 19:26
Comment on lines 262 to 263
opt <- getOption("polars.factor_as_enum")
factor_as_enum <- opt %||% factor_as_enum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this intended to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to allow one to set options(polars.factor_as_enum = TRUE) but I think the logic is wrong. We should use the global option if factor_as_enum is missing in the call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.

  factor_as_enum <- use_option_if_missing(
    factor_as_enum,
    missing(factor_as_enum),
    FALSE,
    "polars."
  )
options(polars.factor_as_enum = TRUE)

data.frame(x = factor(1:2), y = factor(1:2)) |> 
  as_polars_df()
`factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE`
`factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE`
shape: (2, 2)
┌──────┬──────┐
│ x    ┆ y    │
│ ---  ┆ ---  │
│ enum ┆ enum │
╞══════╪══════╡
│ 1    ┆ 1    │
│ 2    ┆ 2    │
└──────┴──────┘

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.

I think you need to handle the argument here to avoid that.

r-polars/R/as_polars_df.R

Lines 178 to 209 in 28fbc06

as_polars_df.list <- function(x, ...) {
.args <- list2(...)
# Should not pass the `name` argument
.args$name <- NULL
list_of_series <- lapply(x, \(column) eval(call2("as_polars_series", column, !!!.args)))
# Series with length 1 should be recycled
unique_lengths <- unique(lengths(list_of_series))
n_lengths <- length(unique_lengths)
list_of_plr_series <- if (n_lengths <= 1L) {
list_of_series |>
lapply(\(series) series$`_s`)
} else {
n_rows <- max(unique_lengths[unique_lengths != 1L])
list_of_series |>
lapply(
\(series) {
if (length(series) == 1L) {
# Recycle the series with length 1
pl$select(pl$repeat_(series, n_rows))$to_series()$`_s`
} else {
series$`_s`
}
}
)
}
list_of_plr_series |>
PlRDataFrame$init() |>
wrap()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I should add factor_as_enum after ... in the function definition, or that I should just call getOption("polars.factor_as_enum") in as_polars_df.list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, I feel that overriding with options is dangerous as it could potentially destroy behavior everywhere.
Could you keep it as it cannot be used except by explicit specification?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants