Skip to content

DeepOLA: Python Interface Thoughts #159

@nikhil96sher

Description

@nikhil96sher

py-wake

  • Things to figure out

    • Drop in performance when having Python with Rust backend.
  • Interface similar to pandas:

    • df1 = edf.read_csv(file_names) - Only creates and doesn't execute.

      • To obtain actual result, you need to call another function on the dataframe.

      • WAY 1: Getting Result 1 at a time - Extrinsic Result of this DF.

         while(df1.has_result())
           result_df = df1.result()
        
      • WAY 2: Getting Result Progressively. Stopping execution when good enough.
        df1.run_online(callback_function)

        • The good part is - stopping execution stops execution of all nodes.
        • We are still able to have some benefit of pipelining since multiple steps.
        • How to provide a mechanism to stop execution while preserving progress?
          • Can have an atomic variable which is checked in the Node execution.
          • If the variable is true, then that node's execution is paused.
    • df2 = df1.filter(lambda x: x["quantity"] > 5.0)
      Now, df2 is an execution graph - a custom class which has (1) execution service, (2) the list of nodes, and (3) their execution output channels (which will remain unconsumed, so that the variable can be later re-used in another query):

                  [read_csv]    [appender] => [RESULT-DF2]
                      ||       //
                  [RESULT-DF1]
      
      • When you call result() on this, it starts execution on the read_csv() and appender() node.
      • read_csv() when executing writes to RESULT-DF1, appender reads from RESULT-DF1 and writes to RESULT-DF2. Thus, instead of subscribing to a node, need another public interface to subscribe to a channel.
    • df3 = df2.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))

          [read_csv]      [appender]   [appender] => [RESULT-DF3]
              ||       //     ||      //
          [RESULT-DF1]    [RESULT-DF2]
      
      • Now, each variable df1, df2 => Their corresponding nodes have output results saved.
      • Thus, by default we can make it to be inplace update. So, that copies are not created.
    • Alternate way - by default inplace.

    df1 = edf.read_csv([List of files])
    df1.filter(lambda x: x["quantity"] > 5.0)
    df1.map(lambda x: col("revenue", (1 - x["discount"]) * x["extendedprice"])))
    
    [read_csv] => [appender] => [appender] => [RESULT-DF1]
    
  • Challenges

    • Overhead of stopping execution in-between.
    • If supporting re-using variables across different queries, then that will lead to keeping the output reader available. Higher memory usage.
    • How to be able to transform python functions or expressions to Rust functions/lambdas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions