-
Couldn't load subscription status.
- Fork 24
docs: improve fallbacks vignette #590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -45,13 +45,10 @@ conflict_prefer("filter", "dplyr") | |
| ## Introduction | ||
|
|
||
| The duckplyr package aims at providing a fully compatible drop-in replacement for dplyr. | ||
| All operations, R functions, and data types that are supported by dplyr should work in an identical way with duckplyr. | ||
| This is achieved in two ways: | ||
| Currently, only a carefully selected subset of dplyr's operations, R functions, and R data types are implemented (see `vignette("limits")`). | ||
| Whenever a request cannot be handled by DuckDB, duckplyr falls back to dplyr. | ||
|
|
||
| - A carefully selected subset of dplyr operations, R functions, and R data types are implemented in DuckDB, focusing on faithful translation. | ||
| - When DuckDB does not support an operation, duckplyr falls back to dplyr, guaranteeing identical behavior. | ||
|
|
||
| ## DuckDB mode | ||
| ## A pipeline directly supported by duckplyr | ||
|
|
||
| The following operation is supported by duckplyr: | ||
|
|
||
|
|
@@ -70,18 +67,18 @@ duckdb |> | |
| explain() | ||
| ``` | ||
|
|
||
| The plan shows three operations: | ||
| The plan shows three **operations**: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At most one per article, please, and really only if necessary 📣 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hehe I'll keep the bold for the R-hub blog for instance then. I find it easier for skimming (reading once, then skimming to remind myself of important points) compared to italic. 😇 |
||
|
|
||
| - a data frame scan (the input), | ||
| - a data frame scan (the input), | ||
| - a sort operation, | ||
| - a projection (adding the `b` column and removing the `a` column). | ||
|
|
||
| Each operation is supported by DuckDB. | ||
| The resulting object contains a plan for the entire pipeline that is executed lazily, only when the data is needed. | ||
| Because each operation is supported by DuckDB, the resulting object contains a **plan for the entire pipeline**. | ||
| The plan is only executed when the data is needed, i.e. lazily (see `vignette("prudence")`). | ||
|
|
||
| ## Relation objects | ||
| ### Relation objects | ||
|
|
||
| DuckDB accepts a tree of interconnected _relation objects_ as input. | ||
| DuckDB accepts a tree of interconnected *relation objects* as input. | ||
| Each relation object represents a logical step of the execution plan. | ||
| The duckplyr package translates dplyr verbs into relation objects. | ||
|
|
||
|
|
@@ -101,7 +98,7 @@ duckplyr::last_rel() | |
|
|
||
| The `last_rel()` function now shows a relation that describes logical plan for executing the whole pipeline. | ||
|
|
||
| ## Help from dplyr | ||
| ## A pipeline with functionality not directly supported by duckplyr | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer shorter titles. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's now "Help from dplyr". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about "Outsourcing to dplyr"? |
||
|
|
||
| Using a custom function with a side effect is not supported by DuckDB and triggers a dplyr fallback: | ||
|
|
||
|
|
@@ -118,7 +115,7 @@ fallback <- | |
| select(-a) | ||
| ``` | ||
|
|
||
| The `verbose_plus_one()` function is not supported by DuckDB, so the `mutate()` step is forwarded to dplyr and already executed (eagerly) when the pipeline is defined. | ||
| The `verbose_plus_one()` function is not supported by DuckDB, so the `mutate()` step is handled by dplyr and already executed when the pipeline is defined, i.e. eagerly. | ||
| This is confirmed by the `last_rel()` function: | ||
|
|
||
| ```{r} | ||
|
|
@@ -148,30 +145,26 @@ duckplyr::last_rel() | |
|
|
||
| The `last_rel()` function confirms that only the final `select()` is handled by DuckDB again. | ||
|
|
||
| ## Enforce DuckDB operation | ||
|
|
||
| For any duck frame, one can control the automatic materialization. | ||
| For fallbacks to dplyr, automatic materialization must be allowed for the duck frame at hand, as dplyr necessitates eager evaluation. | ||
|
|
||
| Therefore, by making a data frame frugal, one can ensure a pipeline will error when a fallback to dplyr would have normally happened. | ||
| See `vignette("prudence")` for details. | ||
|
|
||
| By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. | ||
|
|
||
| ## Configure fallbacks | ||
|
|
||
| Using the `fallback_sitrep()` and `fallback_config()` functions you can examine and change settings related to fallbacks. | ||
|
|
||
| - You can choose to make fallbacks verbose with `fallback_config(info = TRUE)`. | ||
|
|
||
| - You can change settings related to logging and reporting fallback to duckplyr development team to inform their work. | ||
| - You can change settings related to logging and reporting fallback to duckplyr development team to inform their work. See `vignette("telemetry")`. | ||
|
|
||
| ### Enforcing DuckDB operation | ||
|
|
||
| For any duck frame, one can control the automatic materialization. | ||
| For fallbacks to dplyr, automatic materialization must be allowed for the frame at hand, as dplyr necessitate eager evaluation. | ||
|
|
||
| Therefore, by making a data frame frugal, one can ensure a pipeline will error when a fallback to dplyr would have normally happened. See `vignette("prudence")`. | ||
|
|
||
| See `vignette("telemetry")` for details. | ||
| By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. | ||
|
|
||
| ## Conclusion | ||
|
|
||
| The fallback mechanism in duckplyr allows for a seamless integration of dplyr verbs and R functions that are not supported by DuckDB. | ||
| It is transparent to the user and only triggers when necessary. | ||
| With small or medium-sized data sets, it will not even be noticeable in most settings. | ||
|
|
||
| See `vignette("large")` for techniques for working with large data, `vignette("limits")` for the currently implementated translations, `vignette("prudence")` for details on controlling fallback behavior, and `vignette("telemetry")` for the automatic reporting of fallback situations. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is confusing to say you achieve a fully compatible drop-in by carefully selecting a subset of features to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that still the case with the current wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could still be rephrased then. But I used "currently" to mean that later this could change.