Skip to content

Commit 9abec7e

Browse files
Fokkojayceslesar
andauthored
Docs: Condens Python API docs (#2139)
I noticed that the docs needed some TLC. - Collapsed some lines to make it more compact. - Avoid imports where possible (eg transforms). - Update docs. - Add an example of the `to_arrow_batch_reader` earlier in the docs. <!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Jayce Slesar <[email protected]>
1 parent 2127a32 commit 9abec7e

File tree

1 file changed

+47
-94
lines changed

1 file changed

+47
-94
lines changed

mkdocs/docs/api.md

Lines changed: 47 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ hide:
2424

2525
# Python API
2626

27-
PyIceberg is based around catalogs to load tables. First step is to instantiate a catalog that loads tables. Let's use the following configuration to define a catalog called `prod`:
27+
(Py)Iceberg is [catalog](https://iceberg.apache.org/terms/#catalog) centric. Meaning that reading/writing data goes via a catalog. First step is to instantiate a catalog to load a table. Let's use the following configuration in `.pyiceberg.yaml` to define a REST catalog called `prod`:
2828

2929
```yaml
3030
catalog:
@@ -33,7 +33,7 @@ catalog:
3333
credential: t-1234:secret
3434
```
3535
36-
Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`:
36+
Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`, for example, in the case of a Hive and REST catalog:
3737

3838
```yaml
3939
catalog:
@@ -47,13 +47,11 @@ catalog:
4747
warehouse: my-warehouse
4848
```
4949

50-
and loaded in python by calling `load_catalog(name="hive")` and `load_catalog(name="rest")`.
50+
The different catalogs can be loaded in PyIceberg by their name: `load_catalog(name="hive")` and `load_catalog(name="rest")`. An overview of the configuration options can be found on the [configuration page](https://py.iceberg.apache.org/configuration/).
5151

5252
This information must be placed inside a file called `.pyiceberg.yaml` located either in the `$HOME` or `%USERPROFILE%` directory (depending on whether the operating system is Unix-based or Windows-based, respectively), in the current working directory, or in the `$PYICEBERG_HOME` directory (if the corresponding environment variable is set).
5353

54-
For more details on possible configurations refer to the [specific page](https://py.iceberg.apache.org/configuration/).
55-
56-
Then load the `prod` catalog:
54+
It is also possible to load a catalog without using a `.pyiceberg.yaml` by passing in the properties directly:
5755

5856
```python
5957
from pyiceberg.catalog import load_catalog
@@ -70,26 +68,20 @@ catalog = load_catalog(
7068
)
7169
```
7270

73-
Let's create a namespace:
71+
Next, create a namespace:
7472

7573
```python
7674
catalog.create_namespace("docs_example")
7775
```
7876

79-
And then list them:
77+
Or, list existing namespaces:
8078

8179
```python
8280
ns = catalog.list_namespaces()
8381
8482
assert ns == [("docs_example",)]
8583
```
8684

87-
And then list tables in the namespace:
88-
89-
```python
90-
catalog.list_tables("docs_example")
91-
```
92-
9385
## Create a table
9486

9587
To create a table from a catalog:
@@ -123,24 +115,21 @@ schema = Schema(
123115
)
124116
125117
from pyiceberg.partitioning import PartitionSpec, PartitionField
126-
from pyiceberg.transforms import DayTransform
127118
128119
partition_spec = PartitionSpec(
129120
PartitionField(
130-
source_id=1, field_id=1000, transform=DayTransform(), name="datetime_day"
121+
source_id=1, field_id=1000, transform="day", name="datetime_day"
131122
)
132123
)
133124
134125
from pyiceberg.table.sorting import SortOrder, SortField
135-
from pyiceberg.transforms import IdentityTransform
136126
137127
# Sort on the symbol
138-
sort_order = SortOrder(SortField(source_id=2, transform=IdentityTransform()))
128+
sort_order = SortOrder(SortField(source_id=2, transform='identity'))
139129
140130
catalog.create_table(
141131
identifier="docs_example.bids",
142132
schema=schema,
143-
location="s3://pyiceberg",
144133
partition_spec=partition_spec,
145134
sort_order=sort_order,
146135
)
@@ -153,32 +142,24 @@ To create a table using a pyarrow schema:
153142
```python
154143
import pyarrow as pa
155144
156-
schema = pa.schema(
157-
[
145+
schema = pa.schema([
158146
pa.field("foo", pa.string(), nullable=True),
159147
pa.field("bar", pa.int32(), nullable=False),
160148
pa.field("baz", pa.bool_(), nullable=True),
161-
]
162-
)
149+
])
163150
164151
catalog.create_table(
165152
identifier="docs_example.bids",
166153
schema=schema,
167154
)
168155
```
169156

170-
To create a table with some subsequent changes atomically in a transaction:
157+
Another API to create a table is using the `create_table_transaction`. This follows the same APIs when making updates to a table. This is a friendly API for both setting the partition specification and sort-order, because you don't have to deal with field-IDs.
171158

172159
```python
173-
with catalog.create_table_transaction(
174-
identifier="docs_example.bids",
175-
schema=schema,
176-
location="s3://pyiceberg",
177-
partition_spec=partition_spec,
178-
sort_order=sort_order,
179-
) as txn:
160+
with catalog.create_table_transaction(identifier="docs_example.bids", schema=schema) as txn:
180161
with txn.update_schema() as update_schema:
181-
update_schema.add_column(path="new_column", field_type=StringType())
162+
update_schema.add_column(path="new_column", field_type='string')
182163
183164
with txn.update_spec() as update_spec:
184165
update_spec.add_identity("symbol")
@@ -188,6 +169,8 @@ with catalog.create_table_transaction(
188169

189170
## Load a table
190171

172+
There are two ways of reading an Iceberg table; through a catalog, and by pointing at the Iceberg metadata directly. Reading through a catalog is preferred, and directly pointing at the metadata is read-only.
173+
191174
### Catalog table
192175

193176
Loading the `bids` table:
@@ -203,7 +186,7 @@ This returns a `Table` that represents an Iceberg table that can be queried and
203186

204187
### Static table
205188

206-
To load a table directly from a metadata file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
189+
To load a table directly from a `metadata.json` file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
207190

208191
```python
209192
from pyiceberg.table import StaticTable
@@ -213,16 +196,13 @@ static_table = StaticTable.from_metadata(
213196
)
214197
```
215198

216-
The static-table is considered read-only.
217-
218-
Alternatively, if your table metadata directory contains a `version-hint.text` file, you can just specify
219-
the table root path, and the latest metadata file will be picked automatically.
199+
The static-table does not allow for write operations. If your table metadata directory contains a `version-hint.text` file, you can just specify the table root path, and the latest `metadata.json` file will be resolved automatically:
220200

221201
```python
222202
from pyiceberg.table import StaticTable
223203
224204
static_table = StaticTable.from_metadata(
225-
"s3://warehouse/wh/nyc.db/taxis
205+
"s3://warehouse/wh/nyc.db/taxis"
226206
)
227207
```
228208

@@ -236,9 +216,9 @@ catalog.table_exists("docs_example.bids")
236216

237217
Returns `True` if the table already exists.
238218

239-
## Write support
219+
## Write to a table
240220

241-
With PyIceberg 0.6.0 write support is added through Arrow. Let's consider an Arrow Table:
221+
Reading and writing is being done using [Apache Arrow](https://arrow.apache.org/). Arrow is an in-memory columnar format for fast data interchange and in-memory analytics. Let's consider the following Arrow Table:
242222

243223
```python
244224
import pyarrow as pa
@@ -253,31 +233,22 @@ df = pa.Table.from_pylist(
253233
)
254234
```
255235

256-
Next, create a table based on the schema:
236+
Next, create a table using the Arrow schema:
257237

258238
```python
259239
from pyiceberg.catalog import load_catalog
260240
261241
catalog = load_catalog("default")
262242
263-
from pyiceberg.schema import Schema
264-
from pyiceberg.types import NestedField, StringType, DoubleType
265-
266-
schema = Schema(
267-
NestedField(1, "city", StringType(), required=False),
268-
NestedField(2, "lat", DoubleType(), required=False),
269-
NestedField(3, "long", DoubleType(), required=False),
270-
)
271-
272-
tbl = catalog.create_table("default.cities", schema=schema)
243+
tbl = catalog.create_table("default.cities", schema=df.schema)
273244
```
274245

275-
Now write the data to the table:
246+
Next, write the data to the table. Both `append` and `overwrite` produce the same result, since the table is empty on creation:
276247

277248
<!-- prettier-ignore-start -->
278249

279250
!!! note inline end "Fast append"
280-
PyIceberg default to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables quick writes, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a normal commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
251+
PyIceberg defaults to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables fast commit operations, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a merge commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
281252

282253
<!-- prettier-ignore-end -->
283254

@@ -289,7 +260,7 @@ tbl.append(df)
289260
tbl.overwrite(df)
290261
```
291262

292-
The data is written to the table, and when the table is read using `tbl.scan().to_arrow()`:
263+
Now, the data is written to the table, and the table can be read using `tbl.scan().to_arrow()`:
293264

294265
```python
295266
pyarrow.Table
@@ -302,14 +273,12 @@ lat: [[52.371807,37.773972,53.11254,48.864716]]
302273
long: [[4.896029,-122.431297,6.0989,2.349014]]
303274
```
304275

305-
You both can use `append(df)` or `overwrite(df)` since there is no data yet. If we want to add more data, we can use `.append()` again:
276+
If we want to add more data, we can use `.append()` again:
306277

307278
```python
308-
df = pa.Table.from_pylist(
279+
tbl.append(pa.Table.from_pylist(
309280
[{"city": "Groningen", "lat": 53.21917, "long": 6.56667}],
310-
)
311-
312-
tbl.append(df)
281+
))
313282
```
314283

315284
When reading the table `tbl.scan().to_arrow()` you can see that `Groningen` is now also part of the table:
@@ -325,33 +294,30 @@ lat: [[52.371807,37.773972,53.11254,48.864716],[53.21917]]
325294
long: [[4.896029,-122.431297,6.0989,2.349014],[6.56667]]
326295
```
327296

328-
The nested lists indicate the different Arrow buffers, where the first write results into a buffer, and the second append in a separate buffer. This is expected since it will read two parquet files.
329-
330-
To avoid any type errors during writing, you can enforce the PyArrow table types using the Iceberg table schema:
297+
The nested lists indicate the different Arrow buffers. Each of the writes produce a [Parquet file](https://parquet.apache.org/) where each [row group](https://parquet.apache.org/docs/concepts/) translates into an Arrow buffer. In the case where the table is large, PyIceberg also allows the option to stream the buffers using the Arrow [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), avoiding pulling everything into memory right away:
331298

332299
```python
333-
from pyiceberg.catalog import load_catalog
334-
import pyarrow as pa
300+
for buf in tbl.scan().to_arrow_batch_reader():
301+
print(f"Buffer contains {len(buf)} rows")
302+
```
335303

336-
catalog = load_catalog("default")
337-
table = catalog.load_table("default.cities")
338-
schema = table.schema().as_arrow()
304+
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
339305

306+
```python
340307
df = pa.Table.from_pylist(
341-
[{"city": "Groningen", "lat": 53.21917, "long": 6.56667}], schema=schema
308+
[{"city": "Groningen", "lat": 53.21917, "long": 6.56667}], schema=table.schema().as_arrow()
342309
)
343310
344-
table.append(df)
311+
tbl.append(df)
345312
```
346313

347-
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`.
314+
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`. This will use the Iceberg metadata to only open up the Parquet files that contain relevant information.
348315

349316
```python
350317
tbl.delete(delete_filter="city == 'Paris'")
351318
```
352319

353-
In the above example, any records where the city field value equals to `Paris` will be deleted.
354-
Running `tbl.scan().to_arrow()` will now yield:
320+
In the above example, any records where the city field value equals to `Paris` will be deleted. Running `tbl.scan().to_arrow()` will now yield:
355321

356322
```python
357323
pyarrow.Table
@@ -364,30 +330,11 @@ lat: [[52.371807,37.773972,53.11254],[53.21917]]
364330
long: [[4.896029,-122.431297,6.0989],[6.56667]]
365331
```
366332

367-
### Partial overwrites
368-
369-
When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table.
370-
371-
For example, with an iceberg table created as:
372-
373-
```python
374-
from pyiceberg.catalog import load_catalog
375-
376-
catalog = load_catalog("default")
377-
378-
from pyiceberg.schema import Schema
379-
from pyiceberg.types import NestedField, StringType, DoubleType
333+
In the case of `tbl.delete(delete_filter="city == 'Groningen'")`, the whole Parquet file will be dropped without checking it contents, since from the Iceberg metadata PyIceberg can derive that all the content in the file matches the predicate.
380334

381-
schema = Schema(
382-
NestedField(1, "city", StringType(), required=False),
383-
NestedField(2, "lat", DoubleType(), required=False),
384-
NestedField(3, "long", DoubleType(), required=False),
385-
)
386-
387-
tbl = catalog.create_table("default.cities", schema=schema)
388-
```
335+
### Partial overwrites
389336

390-
And with initial data populating the table:
337+
When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table. For example, consider the following Iceberg table:
391338

392339
```python
393340
import pyarrow as pa
@@ -399,6 +346,12 @@ df = pa.Table.from_pylist(
399346
{"city": "Paris", "lat": 48.864716, "long": 2.349014},
400347
],
401348
)
349+
350+
from pyiceberg.catalog import load_catalog
351+
catalog = load_catalog("default")
352+
353+
tbl = catalog.create_table("default.cities", schema=df.schema)
354+
402355
tbl.append(df)
403356
```
404357

0 commit comments

Comments
 (0)