Skip to content
Søren Havelund Welling edited this page Nov 20, 2022 · 12 revisions

Background

polars is the fastest new data manipulation library written in Rust using Apache arrow storage. For e.g. larger data pipelines polars brings to R:

  • Lazy file scanners (parquet, csv, idf, ....)
  • Lazy interaction with SQL databases
  • Query optimization across mixed data sources
  • Larger than memory data manipulation
  • Seemless multi-threading
  • Easy and powerful scalability to hundreds of CPU's without cluster computing or much configuration
  • A type rich environment
  • The immutable + (copy-on-write) data structures are very true to the spirit of the R functional paradigms

Related work (opinionated, feel free to disagree)

  • data.table package: C instead of Rust. Not arrow storage. No query optimization. No lazy syntax. Still pretty awesome.
  • arrow package: arrow storage + dplyr. No optimization, no extensive multithredding. A very popular syntax.
  • sparkR: polars copied the syntax. Great for Big Data. Cumbersome to setup, especially in a CI/CD machine-learning environment. Not very efficient (computation/resource). Long boot-up times. Only reasonable fast when using large clusters for long periods.

Details of project: Bring awesome polars to R now!

Polars has +500 'functions/methods' to implement, and new ones are added every month. Very much needed contributions are: To bind features in rust API, write the R function + docs + tests. Many tasks are not that difficult, but there is a lot of work to be done! There are also R only tasks to improve the R syntax with best practices + write vignettes and tutorials. If you are a rust and C wizz there are still interesting task on performance improvements to be sorted out.

+40% features translated (by November 2022) in:

early proof of concept including the R altrep vector:

early requests on extendr and polars issue track:

Also important:

  • extendr: invaluable ground work to fuse R and rust. Template.
  • py-polars: How polars was implemented in python.
  • nodejs-polars: How polars was implemented in node-js.
  • The book for starting rust: It's an amazing journey. After, you will see programming differently.

Expected impact

If R should stay relevant as a production language, then polars is a great stepping stone. For tabular data where computation resources are a limiting factor, polars should be considered.

Mentors

Contributors, please contact mentors below after completing at least one of the tests below.

  • Soren H. Welling [email protected] author of minipolars. New to R-GSOC. Independent consultant tackling data science problems with R, C++ and python. On a deep dive into rust since last year. PhD in some ML + computational chemistry.

  • Toby Hocking [email protected] has 10+ years experience in R-GSOC, and can co-mentor.

Tests

Contributors, please do one or more of the following tests before contacting the mentors above.

MENTORS: write several tests that potential contributors can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the contributors write code to solve problems. You'll see that the harder the questions that you ask, the easier it will be for you to choose between the contributors that apply for your project! Please modify the suggestions below to make them specific for your project.

  • Easy: Install minipolars directely from precomiled binary, see github Readme.md . Write a lazy query which lazily reads two csv files + join them + filter + column manipulation via expressions. Build the query with at least 15 of the already translated expression functions. Use apply and/or map to execute an R user function within a lazy polars query.

  • Medium: Use rustup to install rust nightly. Clone minipolars. Restore renv environment. Build/compile the minipolars package locally. Implement a function Expr_sum_add2 which should behave like Expr_sum but also add 2 to the result. The add 2 implementation should be on the rust side (see Expr::add and eager/lazy cookbooks in polars rust api docs). Add documentation and show the function can be used in the package.

  • Hard: Make a pull request implementing scan_ipc(py-polars example) similar to scan_csv(minipolars example. Far from all types are auto translated by extendr, so you will likely have to write/use some wrapper types.

Solutions of tests

Contributors, please post a link to your test results here.

  • EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.
Clone this wiki locally