Docs: Condens Python API docs (#2139)

Fokko · jayceslesar · web-flow · commit 9abec7ec2bbe · 2025-07-06T13:08:22.000-07:00
I noticed that the docs needed some TLC.

- Collapsed some lines to make it more compact.
- Avoid imports where possible (eg transforms).
- Update docs.
- Add an example of the `to_arrow_batch_reader` earlier in the docs.

&lt;!--
Thanks for opening a pull request!
--&gt;

&lt;!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. --&gt;
&lt;!-- Closes #${GITHUB_ISSUE_ID} --&gt;

# Rationale for this change

# Are these changes tested?

# Are there any user-facing changes?

&lt;!-- In the case of user-facing changes, please add the changelog label.
--&gt;

---------

Co-authored-by: Jayce Slesar &lt;47452474+jayceslesar@users.noreply.github.com&gt;
diff --git a/mkdocs/docs/api.md b/mkdocs/docs/api.md
@@ -24,7 +24,7 @@ hide:
 
 # Python API
 
-PyIceberg is based around catalogs to load tables. First step is to instantiate a catalog that loads tables. Let's use the following configuration to define a catalog called `prod`:
+(Py)Iceberg is [catalog](https://iceberg.apache.org/terms/#catalog) centric. Meaning that reading/writing data goes via a catalog. First step is to instantiate a catalog to load a table. Let's use the following configuration in `.pyiceberg.yaml` to define a REST catalog called `prod`:
 
 ```yaml
 catalog:
@@ -33,7 +33,7 @@ catalog:
     credential: t-1234:secret
 ```
 
-Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`:
+Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`, for example, in the case of a Hive and REST catalog:
 
 ```yaml
 catalog:
@@ -47,13 +47,11 @@ catalog:
     warehouse: my-warehouse
 ```
 
-and loaded in python by calling `load_catalog(name="hive")` and `load_catalog(name="rest")`.
+The different catalogs can be loaded in PyIceberg by their name: `load_catalog(name="hive")` and `load_catalog(name="rest")`. An overview of the configuration options can be found on the [configuration page](https://py.iceberg.apache.org/configuration/).
 
 This information must be placed inside a file called `.pyiceberg.yaml` located either in the `$HOME` or `%USERPROFILE%` directory (depending on whether the operating system is Unix-based or Windows-based, respectively), in the current working directory, or in the `$PYICEBERG_HOME` directory (if the corresponding environment variable is set).
 
-For more details on possible configurations refer to the [specific page](https://py.iceberg.apache.org/configuration/).
-
-Then load the `prod` catalog:
+It is also possible to load a catalog without using a `.pyiceberg.yaml` by passing in the properties directly:
 
 ```python
 from pyiceberg.catalog import load_catalog
@@ -70,26 +68,20 @@ catalog = load_catalog(
 )
 ```
 
-Let's create a namespace:
+Next, create a namespace:
 
 ```python
 catalog.create_namespace("docs_example")
 ```
 
-And then list them:
+Or, list existing namespaces:
 
 ```python
 ns = catalog.list_namespaces()
 
 assert ns == [("docs_example",)]
 ```
 
-And then list tables in the namespace:
-
-```python
-catalog.list_tables("docs_example")
-```
-
 ## Create a table
 
 To create a table from a catalog:
@@ -123,24 +115,21 @@ schema = Schema(
 )
 
 from pyiceberg.partitioning import PartitionSpec, PartitionField
-from pyiceberg.transforms import DayTransform
 
 partition_spec = PartitionSpec(
     PartitionField(
-        source_id=1, field_id=1000, transform=DayTransform(), name="datetime_day"
+        source_id=1, field_id=1000, transform="day", name="datetime_day"
     )
 )
 
 from pyiceberg.table.sorting import SortOrder, SortField
-from pyiceberg.transforms import IdentityTransform
 
 # Sort on the symbol
-sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
+sort_order = SortOrder(SortField(source_id=2, transform='identity'))
 
 catalog.create_table(
     identifier="docs_example.bids",
     schema=schema,
-    location="s3://pyiceberg",
     partition_spec=partition_spec,
     sort_order=sort_order,
 )
@@ -153,32 +142,24 @@ To create a table using a pyarrow schema:
 ```python
 import pyarrow as pa
 
-schema = pa.schema(
-    [
+schema = pa.schema([
         pa.field("foo", pa.string(), nullable=True),
         pa.field("bar", pa.int32(), nullable=False),
         pa.field("baz", pa.bool_(), nullable=True),
-    ]
-)
+])
 
 catalog.create_table(
     identifier="docs_example.bids",
     schema=schema,
 )
 ```
 
-To create a table with some subsequent changes atomically in a transaction:
+Another API to create a table is using the `create_table_transaction`. This follows the same APIs when making updates to a table. This is a friendly API for both setting the partition specification and sort-order, because you don't have to deal with field-IDs.
 
 ```python
-with catalog.create_table_transaction(
-    identifier="docs_example.bids",
-    schema=schema,
-    location="s3://pyiceberg",
-    partition_spec=partition_spec,
-    sort_order=sort_order,
-) as txn:
+with catalog.create_table_transaction(identifier="docs_example.bids", schema=schema) as txn:
     with txn.update_schema() as update_schema:
-        update_schema.add_column(path="new_column", field_type=StringType())
+        update_schema.add_column(path="new_column", field_type='string')
 
     with txn.update_spec() as update_spec:
         update_spec.add_identity("symbol")
@@ -188,6 +169,8 @@ with catalog.create_table_transaction(
 
 ## Load a table
 
+There are two ways of reading an Iceberg table; through a catalog, and by pointing at the Iceberg metadata directly. Reading through a catalog is preferred, and directly pointing at the metadata is read-only.
+
 ### Catalog table
 
 Loading the `bids` table:
@@ -203,7 +186,7 @@ This returns a `Table` that represents an Iceberg table that can be queried and
 
 ### Static table
 
-To load a table directly from a metadata file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
+To load a table directly from a `metadata.json` file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
 
 ```python
 from pyiceberg.table import StaticTable
@@ -213,16 +196,13 @@ static_table = StaticTable.from_metadata(
 )
 ```
 
-The static-table is considered read-only.
-
-Alternatively, if your table metadata directory contains a `version-hint.text` file, you can just specify
-the table root path, and the latest metadata file will be picked automatically.
+The static-table does not allow for write operations. If your table metadata directory contains a `version-hint.text` file, you can just specify  the table root path, and the latest `metadata.json` file will be resolved automatically:
 
 ```python
 from pyiceberg.table import StaticTable
 
 static_table = StaticTable.from_metadata(
-    "s3://warehouse/wh/nyc.db/taxis
+    "s3://warehouse/wh/nyc.db/taxis"
 )
 ```
 
@@ -236,9 +216,9 @@ catalog.table_exists("docs_example.bids")
 
 Returns `True` if the table already exists.
 
-## Write support
+## Write to a table
 
-With PyIceberg 0.6.0 write support is added through Arrow. Let's consider an Arrow Table:
+Reading and writing is being done using [Apache Arrow](https://arrow.apache.org/). Arrow is an in-memory columnar format for fast data interchange and in-memory analytics. Let's consider the following Arrow Table:
 
 ```python
 import pyarrow as pa
@@ -253,31 +233,22 @@ df = pa.Table.from_pylist(
 )
 ```
 
-Next, create a table based on the schema:
+Next, create a table using the Arrow schema:
 
 ```python
 from pyiceberg.catalog import load_catalog
 
 catalog = load_catalog("default")
 
-from pyiceberg.schema import Schema
-from pyiceberg.types import NestedField, StringType, DoubleType
-
-schema = Schema(
-    NestedField(1, "city", StringType(), required=False),
-    NestedField(2, "lat", DoubleType(), required=False),
-    NestedField(3, "long", DoubleType(), required=False),
-)
-
-tbl = catalog.create_table("default.cities", schema=schema)
+tbl = catalog.create_table("default.cities", schema=df.schema)
 ```
 
-Now write the data to the table:
+Next, write the data to the table. Both `append` and `overwrite` produce the same result, since the table is empty on creation:
 
 <!-- prettier-ignore-start -->
 
 !!! note inline end "Fast append"
-    PyIceberg default to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables quick writes, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a normal commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
+    PyIceberg defaults to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables fast commit operations, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a merge commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
 
 <!-- prettier-ignore-end -->
 
@@ -289,7 +260,7 @@ tbl.append(df)
 tbl.overwrite(df)
 ```
 
-The data is written to the table, and when the table is read using `tbl.scan().to_arrow()`:
+Now, the data is written to the table, and the table can be read using `tbl.scan().to_arrow()`:
 
 ```python
 pyarrow.Table
@@ -302,14 +273,12 @@ lat: [[52.371807,37.773972,53.11254,48.864716]]
 long: [[4.896029,-122.431297,6.0989,2.349014]]
 ```
 
-You both can use `append(df)` or `overwrite(df)` since there is no data yet. If we want to add more data, we can use `.append()` again:
+If we want to add more data, we can use `.append()` again:
 
 ```python
-df = pa.Table.from_pylist(
+tbl.append(pa.Table.from_pylist(
     [{"city": "Groningen", "lat": 53.21917, "long": 6.56667}],
-)
-
-tbl.append(df)
+))
 ```
 
 When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table:
@@ -325,33 +294,30 @@ lat: [[52.371807,37.773972,53.11254,48.864716],[53.21917]]
 long: [[4.896029,-122.431297,6.0989,2.349014],[6.56667]]
 ```
 
-The nested lists indicate the different Arrow buffers, where the first write results into a buffer, and the second append in a separate buffer. This is expected since it will read two parquet files.
-
-To avoid any type errors during writing, you can enforce the PyArrow table types using the Iceberg table schema:
+The nested lists indicate the different Arrow buffers. Each of the writes produce a [Parquet file](https://parquet.apache.org/) where each [row group](https://parquet.apache.org/docs/concepts/) translates into an Arrow buffer. In the case where the table is large, PyIceberg also allows the option to stream the buffers using the Arrow [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), avoiding pulling everything into memory right away:
 
 ```python
-from pyiceberg.catalog import load_catalog
-import pyarrow as pa
+for buf in tbl.scan().to_arrow_batch_reader():
+    print(f"Buffer contains {len(buf)} rows")
+```
 
-catalog = load_catalog("default")
-table = catalog.load_table("default.cities")
-schema = table.schema().as_arrow()
+To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
 
+```python
 df = pa.Table.from_pylist(
-    [{"city": "Groningen", "lat": 53.21917, "long": 6.56667}], schema=schema
+    [{"city": "Groningen", "lat": 53.21917, "long": 6.56667}], schema=table.schema().as_arrow()
 )
 
-table.append(df)
+tbl.append(df)
 ```
 
-You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`.
+You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`. This will use the Iceberg metadata to only open up the Parquet files that contain relevant information.
 
 ```python
 tbl.delete(delete_filter="city == 'Paris'")
 ```
 
-In the above example, any records where the city field value equals to `Paris` will be deleted.
-Running `tbl.scan().to_arrow()` will now yield:
+In the above example, any records where the city field value equals to `Paris` will be deleted. Running `tbl.scan().to_arrow()` will now yield:
 
 ```python
 pyarrow.Table
@@ -364,30 +330,11 @@ lat: [[52.371807,37.773972,53.11254],[53.21917]]
 long: [[4.896029,-122.431297,6.0989],[6.56667]]
 ```
 
-### Partial overwrites
-
-When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table.
-
-For example, with an iceberg table created as:
-
-```python
-from pyiceberg.catalog import load_catalog
-
-catalog = load_catalog("default")
-
-from pyiceberg.schema import Schema
-from pyiceberg.types import NestedField, StringType, DoubleType
+In the case of `tbl.delete(delete_filter="city == 'Groningen'")`, the whole Parquet file will be dropped without checking it contents, since from the Iceberg metadata PyIceberg can derive that all the content in the file matches the predicate.
 
-schema = Schema(
-    NestedField(1, "city", StringType(), required=False),
-    NestedField(2, "lat", DoubleType(), required=False),
-    NestedField(3, "long", DoubleType(), required=False),
-)
-
-tbl = catalog.create_table("default.cities", schema=schema)
-```
+### Partial overwrites
 
-And with initial data populating the table:
+When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table. For example, consider the following Iceberg table:
 
 ```python
 import pyarrow as pa
@@ -399,6 +346,12 @@ df = pa.Table.from_pylist(
         {"city": "Paris", "lat": 48.864716, "long": 2.349014},
     ],
 )
+
+from pyiceberg.catalog import load_catalog
+catalog = load_catalog("default")
+
+tbl = catalog.create_table("default.cities", schema=df.schema)
+
 tbl.append(df)
 ```