feat!: Convert R factors to Polars Enum and keep all levels when converting Polars Enum to R factors by etiennebacher · Pull Request #1723 · pola-rs/r-polars

etiennebacher · 2026-01-27T23:08:21Z

I think we should improve the R -> Polars and Polars -> R conversions for factors.

Categorical variables are weird in Polars

With the Categorical refactor from a few months ago in upstream Polars, there's a single pool for all categories, which leads to this kind of situation where levels of all columns are mixed together:

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat$select(pl$col("x")$cat$get_categories())
#> shape: (4, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ d   │
#> │ e   │
#> └─────┘

I think R factors are much closer to Enums than to Categoricals (honestly I'm not even sure that Polars Categoricals are useful anymore):

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)$cast(x = pl$Enum(c("a", "b", "c")), y = pl$Enum(c("d", "e")))

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

-> R factors should be converted to Polars enums

Converting a Categorical to R loses the levels information

This happens a lot in tidypolars tests: I start from iris (in DataFrame format), filter out all rows that match a certain Categorical value, and convert to data.frame. But then this loses the information about the level that was filtered out while the same operation in tidyverse keeps the levels info.

In this example, we don't even need to filter since not all levels are represented in the data:

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat |>
  as.data.frame() |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b"

data.frame(x = factor(c("a", "b"), levels = c("a", "b", "c"))) |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b" "c"

It would be nice to keep all the levels here.

-> Polars enums should keep all the levels information when converted to R factors

@eitsupi This PR is incomplete, I didn't update the tests for now because I wanted to see if you're ok with it first (and this would be a breaking change). There are about ~60 failures but it seems they would all be solved by an extra $cast() or a snapshot update.

Almost all changes have been done by Claude, but it looks okay to me.

Same examples with this PR:

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat
#> shape: (2, 2)
#> ┌──────┬──────┐
#> │ x    ┆ y    │
#> │ ---  ┆ ---  │
#> │ enum ┆ enum │  # ----> now "enum", not "cat" anymore
#> ╞══════╪══════╡
#> │ a    ┆ d    │
#> │ b    ┆ e    │
#> └──────┴──────┘

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

dat |>
  as.data.frame() |>
  dplyr::pull(x) |>
  levels()
#> [1] "a" "b" "c"

Related to #1597 and #1146

eitsupi

Nice!

I was thinking something similar, but note that Python Polars recently exposed a new argument to Categorical, and I was thinking about bringing that into r-polars, but wanted to move forward with S7-ization first.

I agree that the conversion to Enum is useful, but since Enum is marked as experimental in r-polars, I think it should be handled with the factor -> Enum conversion options argument.

etiennebacher · 2026-01-28T09:17:33Z

I was thinking something similar, but note that Python Polars recently exposed a new argument to Categorical

Ah interesting, I missed this addition. I'll see if I can bring this into r-polars in another PR and then we can determine if this one is still necessary.

eitsupi · 2026-01-28T11:47:39Z

Actually, I have only used factor to make ggplot2 pretty, so I'm not sure whether it behaves more like Polar's Categorical or Enum.
The disadvantage of Enum is that we can't add other category later.

etiennebacher · 2026-01-28T12:23:28Z

Honestly I'm getting a bit lost between enums and categoricals. I will use the current branch to try to implement some forcats functions in tidypolars so I have a better idea of whether enums cover all the use cases.

The disadvantage of Enum is that we can't add other category later.

We can always cast to another Enum with more categories.

library(polars)

dat <- pl$DataFrame(
  x = factor(c("a", "b"), levels = c("a", "b", "c")),
  y = factor(c("d", "e"), levels = c("d", "e"))
)

dat$select(pl$col("x")$cat$get_categories())
#> shape: (3, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> └─────┘

dat$cast(x = pl$Enum(c("a", "b", "c", "d", "e")))$select(pl$col("x")$cat$get_categories())
#> shape: (5, 1)
#> ┌─────┐
#> │ x   │
#> │ --- │
#> │ str │
#> ╞═════╡
#> │ a   │
#> │ b   │
#> │ c   │
#> │ d   │
#> │ e   │
#> └─────┘

eitsupi · 2026-01-28T12:40:31Z

I think this branch can basically be merged, but it requires an additional argument to as_polars_series.factor to opt into the behavior.

eitsupi · 2026-01-28T23:43:43Z

R/as_polars_series.R

+  opt <- getOption("polars.factor_as_enum")
+  factor_as_enum <- opt %||% factor_as_enum


What does this intended to?

The idea is to allow one to set options(polars.factor_as_enum = TRUE) but I think the logic is wrong. We should use the global option if factor_as_enum is missing in the call.

Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.

factor_as_enum <- use_option_if_missing( factor_as_enum, missing(factor_as_enum), FALSE, "polars." )

options(polars.factor_as_enum = TRUE) data.frame(x = factor(1:2), y = factor(1:2)) |> as_polars_df()

`factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE` `factor_as_enum` is overridden by the option "polars.factor_as_enum" with `TRUE` shape: (2, 2) ┌──────┬──────┐ │ x ┆ y │ │ --- ┆ --- │ │ enum ┆ enum │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ 2 │ └──────┴──────┘

Note that if we use use_option_if_missing() here, we get duplicated messages if we convert a data.frame with several factors to a polars dataframe.

I think you need to handle the argument here to avoid that.

r-polars/R/as_polars_df.R

Lines 178 to 209 in 28fbc06

as_polars_df.list <- function(x, ...) {

.args <- list2(...)

# Should not pass the `name` argument

.args$name <- NULL

list_of_series <- lapply(x, \(column) eval(call2("as_polars_series", column, !!!.args)))

# Series with length 1 should be recycled

unique_lengths <- unique(lengths(list_of_series))

n_lengths <- length(unique_lengths)

list_of_plr_series <- if (n_lengths <= 1L) {

list_of_series |>

lapply(\(series) series$`_s`)

} else {

n_rows <- max(unique_lengths[unique_lengths != 1L])

list_of_series |>

lapply(

\(series) {

if (length(series) == 1L) {

# Recycle the series with length 1

pl$select(pl$repeat_(series, n_rows))$to_series()$`_s`

} else {

series$`_s`

}

}

)

}

list_of_plr_series |>

PlRDataFrame$init() |>

wrap()

}

Do you mean I should add factor_as_enum after ... in the function definition, or that I should just call getOption("polars.factor_as_enum") in as_polars_df.list?

Thinking about it again, I feel that overriding with options is dangerous as it could potentially destroy behavior everywhere.
Could you keep it as it cannot be used except by explicit specification?

R/datatypes-classes.R

R/as_polars_series.R

etiennebacher added 2 commits January 27, 2026 23:46

convert R factors to Polars enum

9884fd3

convert Polars enums to R factors [skip ci]

9caeeac

etiennebacher requested review from Copilot and eitsupi January 27, 2026 23:08

Copilot started reviewing on behalf of etiennebacher January 27, 2026 23:08 View session

This comment was marked as spam.

Sign in to view

This comment was marked as outdated.

Sign in to view

fix [skip ci]

4d2cb58

eitsupi reviewed Jan 28, 2026

View reviewed changes

etiennebacher added 3 commits January 28, 2026 19:37

docs and test

fbfe676

news

5d8600f

add test for Polars enum -> R factor

3b04c43

etiennebacher marked this pull request as ready for review January 28, 2026 19:26

etiennebacher added 2 commits January 28, 2026 20:29

bump version

6af980a

fix

0154b78

eitsupi requested changes Jan 28, 2026

View reviewed changes

R/as_polars_series.R Outdated Show resolved Hide resolved

comments

e83f48e

etiennebacher mentioned this pull request Jan 29, 2026

feat: Start supporting forcats functions etiennebacher/tidypolars#319

Draft

		opt <- getOption("polars.factor_as_enum")
		factor_as_enum <- opt %\|\|% factor_as_enum

	as_polars_df.list <- function(x, ...) {
	.args <- list2(...)
	# Should not pass the `name` argument
	.args$name <- NULL
	list_of_series <- lapply(x, \(column) eval(call2("as_polars_series", column, !!!.args)))

	# Series with length 1 should be recycled
	unique_lengths <- unique(lengths(list_of_series))
	n_lengths <- length(unique_lengths)

	list_of_plr_series <- if (n_lengths <= 1L) {
	list_of_series \|>
	lapply(\(series) series$`_s`)
	} else {
	n_rows <- max(unique_lengths[unique_lengths != 1L])
	list_of_series \|>
	lapply(
	\(series) {
	if (length(series) == 1L) {
	# Recycle the series with length 1
	pl$select(pl$repeat_(series, n_rows))$to_series()$`_s`
	} else {
	series$`_s`
	}
	}
	)
	}

	list_of_plr_series \|>
	PlRDataFrame$init() \|>
	wrap()
	}

Conversation

etiennebacher commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Categorical variables are weird in Polars

Converting a Categorical to R loses the levels information

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as outdated.

eitsupi left a comment

Choose a reason for hiding this comment

Uh oh!

etiennebacher commented Jan 28, 2026

Uh oh!

eitsupi commented Jan 28, 2026

Uh oh!

etiennebacher commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eitsupi commented Jan 28, 2026

Uh oh!

eitsupi Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

etiennebacher Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

etiennebacher Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

eitsupi Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

etiennebacher Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

eitsupi Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

etiennebacher commented Jan 27, 2026 •

edited

Loading

etiennebacher commented Jan 28, 2026 •

edited

Loading