-
Notifications
You must be signed in to change notification settings - Fork 5
Description
py-wake
-
Things to figure out
- Drop in performance when having Python with Rust backend.
-
Interface similar to pandas:
-
df1 = edf.read_csv(file_names)- Only creates and doesn't execute.-
To obtain actual result, you need to call another function on the dataframe.
-
WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.
while(df1.has_result()) result_df = df1.result() -
WAY 2: Getting Result Progressively. Stopping execution when good enough.
df1.run_online(callback_function)- The good part is - stopping execution stops execution of all nodes.
- We are still able to have some benefit of pipelining since multiple steps.
- How to provide a mechanism to stop execution while preserving progress?
- Can have an atomic variable which is checked in the Node execution.
- If the variable is true, then that node's execution is paused.
-
-
df2 = df1.filter(lambda x: x["quantity"] > 5.0)
Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):[read_csv] [appender] => [RESULT-DF2] || // [RESULT-DF1]- When you call result() on this, it starts execution on the read_csv() and appender() node.
- read_csv() when executing writes to RESULT-DF1, appender reads from RESULT-DF1 and writes to RESULT-DF2. Thus, instead of subscribing to a node, need another public interface to subscribe to a channel.
-
df3 = df2.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))[read_csv] [appender] [appender] => [RESULT-DF3] || // || // [RESULT-DF1] [RESULT-DF2]- Now, each variable df1, df2 => Their corresponding nodes have output results saved.
- Thus, by default we can make it to be inplace update. So, that copies are not created.
-
Alternate way - by default inplace.
df1 = edf.read_csv([List of files]) df1.filter(lambda x: x["quantity"] > 5.0) df1.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"]))) [read_csv] => [appender] => [appender] => [RESULT-DF1] -
-
Challenges
- Overhead of stopping execution in-between.
- If supporting re-using variables across different queries, then that will lead to keeping the output reader available. Higher memory usage.
- How to be able to transform python functions or expressions to Rust functions/lambdas?