docs: update README and user guide to reflect register_view method for DataFrame registration

kosiew · kosiew · commit f0837deceb4d · 2025-03-07T11:59:33.000+08:00
diff --git a/README.md b/README.md
@@ -81,7 +81,7 @@ This produces the following chart:
 
 ## Registering a DataFrame as a View
 
-You can use the `into_view` method to convert a DataFrame into a view and register it with the context.
+You can use SessionContext's `register_view` method to convert a DataFrame into a view and register it with the context.
 
 ```python
 from datafusion import SessionContext, col, literal
@@ -98,11 +98,8 @@ df = ctx.from_pydict(data, "my_table")
 # Filter the DataFrame (for example, keep rows where a > 2)
 df_filtered = df.filter(col("a") > literal(2))
 
-# Convert the filtered DataFrame into a view
-view = df_filtered.into_view()
-
-# Register the view with the context
-ctx.register_table("view1", view)
+# Register the dataframe as a view with the context
+ctx.register_view("view1", df_filtered)
 
 # Now run a SQL query against the registered view
 df_view = ctx.sql("SELECT * FROM view1")
diff --git a/docs/source/user-guide/common-operations/views.rst b/docs/source/user-guide/common-operations/views.rst
@@ -19,7 +19,7 @@
 Registering Views
 ======================
 
-You can use the ``into_view`` method to convert a DataFrame into a view and register it with the context.
+You can use the context's ``register_view`` method to register a DataFrame as a view
 
 .. code-block:: python
 
@@ -37,11 +37,8 @@ You can use the ``into_view`` method to convert a DataFrame into a view and regi
     # Filter the DataFrame (for example, keep rows where a > 2)
     df_filtered = df.filter(col("a") > literal(2))
 
-    # Convert the filtered DataFrame into a view
-    view = df_filtered.into_view()
-
-    # Register the view with the context
-    ctx.register_table("view1", view)
+    # Register the dataframe as a view with the context
+    ctx.register_view("view1", df_filtered)
 
     # Now run a SQL query against the registered view
     df_view = ctx.sql("SELECT * FROM view1")
diff --git a/src/dataframe.rs b/src/dataframe.rs
@@ -52,6 +52,9 @@ use crate::{
     expr::{sort_expr::PySortExpr, PyExpr},
 };
 
+// https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
+// - we have not decided on the table_provider approach yet
+// this is an interim implementation
 #[pyclass(name = "TableProvider", module = "datafusion")]
 pub struct PyTableProvider {
     provider: Arc<dyn TableProvider>,
@@ -71,6 +74,57 @@ impl PyTableProvider {
 /// A PyDataFrame is a representation of a logical plan and an API to compose statements.
 /// Use it to build a plan and `.collect()` to execute the plan and collect the result.
 /// The actual execution of a plan runs natively on Rust and Arrow on a multi-threaded environment.
+///
+/// # Methods
+///
+/// - `new`: Creates a new PyDataFrame.
+/// - `__getitem__`: Enable selection for `df[col]`, `df[col1, col2, col3]`, and `df[[col1, col2, col3]]`.
+/// - `__repr__`: Returns a string representation of the DataFrame.
+/// - `_repr_html_`: Returns an HTML representation of the DataFrame.
+/// - `describe`: Calculate summary statistics for a DataFrame.
+/// - `schema`: Returns the schema from the logical plan.
+/// - `into_view`: Convert this DataFrame into a Table that can be used in register_table. We have not finalized on PyTableProvider approach yet.
+/// - `select_columns`: Select columns from the DataFrame.
+/// - `select`: Select expressions from the DataFrame.
+/// - `drop`: Drop columns from the DataFrame.
+/// - `filter`: Filter the DataFrame based on a predicate.
+/// - `with_column`: Add a new column to the DataFrame.
+/// - `with_columns`: Add multiple new columns to the DataFrame.
+/// - `with_column_renamed`: Rename a column in the DataFrame.
+/// - `aggregate`: Aggregate the DataFrame based on group by and aggregation expressions.
+/// - `sort`: Sort the DataFrame based on expressions.
+/// - `limit`: Limit the number of rows in the DataFrame.
+/// - `collect`: Executes the plan, returning a list of `RecordBatch`es.
+/// - `cache`: Cache the DataFrame.
+/// - `collect_partitioned`: Executes the DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning.
+/// - `show`: Print the result, 20 lines by default.
+/// - `distinct`: Filter out duplicate rows.
+/// - `join`: Join two DataFrames.
+/// - `join_on`: Join two DataFrames based on expressions.
+/// - `explain`: Print the query plan.
+/// - `logical_plan`: Get the logical plan for this DataFrame.
+/// - `optimized_logical_plan`: Get the optimized logical plan for this DataFrame.
+/// - `execution_plan`: Get the execution plan for this DataFrame.
+/// - `repartition`: Repartition the DataFrame based on a logical partitioning scheme.
+/// - `repartition_by_hash`: Repartition the DataFrame based on a hash partitioning scheme.
+/// - `union`: Calculate the union of two DataFrames, preserving duplicate rows.
+/// - `union_distinct`: Calculate the distinct union of two DataFrames.
+/// - `unnest_column`: Unnest a column in the DataFrame.
+/// - `unnest_columns`: Unnest multiple columns in the DataFrame.
+/// - `intersect`: Calculate the intersection of two DataFrames.
+/// - `except_all`: Calculate the exception of two DataFrames.
+/// - `write_csv`: Write the DataFrame to a CSV file.
+/// - `write_parquet`: Write the DataFrame to a Parquet file.
+/// - `write_json`: Write the DataFrame to a JSON file.
+/// - `to_arrow_table`: Convert the DataFrame to an Arrow Table.
+/// - `__arrow_c_stream__`: Convert the DataFrame to an Arrow C Stream.
+/// - `execute_stream`: Execute the DataFrame and return a RecordBatchStream.
+/// - `execute_stream_partitioned`: Execute the DataFrame and return partitioned RecordBatchStreams.
+/// - `to_pandas`: Convert the DataFrame to a Pandas DataFrame.
+/// - `to_pylist`: Convert the DataFrame to a Python list.
+/// - `to_pydict`: Convert the DataFrame to a Python dictionary.
+/// - `to_polars`: Convert the DataFrame to a Polars DataFrame.
+/// - `count`: Execute the DataFrame to get the total number of rows.
 #[pyclass(name = "DataFrame", module = "datafusion", subclass)]
 #[derive(Clone)]
 pub struct PyDataFrame {
@@ -179,6 +233,8 @@ impl PyDataFrame {
     /// Disabling the clippy lint, so we can use &self
     /// because we're working with Python bindings
     /// where objects are shared
+    /// https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
+    /// - we have not decided on the table_provider approach yet
     #[allow(clippy::wrong_self_convention)]
     fn into_view(&self) -> PyDataFusionResult<PyTable> {
         // Call the underlying Rust DataFrame::into_view method.