You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the docs needed some TLC.
- Collapsed some lines to make it more compact.
- Avoid imports where possible (eg transforms).
- Update docs.
- Add an example of the `to_arrow_batch_reader` earlier in the docs.
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Jayce Slesar <[email protected]>
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+47-94Lines changed: 47 additions & 94 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ hide:
24
24
25
25
# Python API
26
26
27
-
PyIceberg is based around catalogs to load tables. First step is to instantiate a catalog that loads tables. Let's use the following configuration to define a catalog called `prod`:
27
+
(Py)Iceberg is [catalog](https://iceberg.apache.org/terms/#catalog) centric. Meaning that reading/writing data goes via a catalog. First step is to instantiate a catalog to load a table. Let's use the following configuration in `.pyiceberg.yaml`to define a REST catalog called `prod`:
28
28
29
29
```yaml
30
30
catalog:
@@ -33,7 +33,7 @@ catalog:
33
33
credential: t-1234:secret
34
34
```
35
35
36
-
Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`:
36
+
Note that multiple catalogs can be defined in the same `.pyiceberg.yaml`, for example, in the case of a Hive and REST catalog:
37
37
38
38
```yaml
39
39
catalog:
@@ -47,13 +47,11 @@ catalog:
47
47
warehouse: my-warehouse
48
48
```
49
49
50
-
and loaded in python by calling `load_catalog(name="hive")` and `load_catalog(name="rest")`.
50
+
The different catalogs can be loaded in PyIceberg by their name: `load_catalog(name="hive")`and `load_catalog(name="rest")`. An overview of the configuration options can be found on the [configuration page](https://py.iceberg.apache.org/configuration/).
51
51
52
52
This information must be placed inside a file called `.pyiceberg.yaml` located either in the `$HOME` or `%USERPROFILE%` directory (depending on whether the operating system is Unix-based or Windows-based, respectively), in the current working directory, or in the `$PYICEBERG_HOME` directory (if the corresponding environment variable is set).
53
53
54
-
For more details on possible configurations refer to the [specific page](https://py.iceberg.apache.org/configuration/).
55
-
56
-
Then load the `prod` catalog:
54
+
It is also possible to load a catalog without using a `.pyiceberg.yaml` by passing in the properties directly:
57
55
58
56
```python
59
57
from pyiceberg.catalog import load_catalog
@@ -70,26 +68,20 @@ catalog = load_catalog(
70
68
)
71
69
```
72
70
73
-
Let's create a namespace:
71
+
Next, create a namespace:
74
72
75
73
```python
76
74
catalog.create_namespace("docs_example")
77
75
```
78
76
79
-
And then list them:
77
+
Or, list existing namespaces:
80
78
81
79
```python
82
80
ns = catalog.list_namespaces()
83
81
84
82
assert ns == [("docs_example",)]
85
83
```
86
84
87
-
And then list tables in the namespace:
88
-
89
-
```python
90
-
catalog.list_tables("docs_example")
91
-
```
92
-
93
85
## Create a table
94
86
95
87
To create a table from a catalog:
@@ -123,24 +115,21 @@ schema = Schema(
123
115
)
124
116
125
117
from pyiceberg.partitioning import PartitionSpec, PartitionField
@@ -153,32 +142,24 @@ To create a table using a pyarrow schema:
153
142
```python
154
143
import pyarrow as pa
155
144
156
-
schema = pa.schema(
157
-
[
145
+
schema = pa.schema([
158
146
pa.field("foo", pa.string(), nullable=True),
159
147
pa.field("bar", pa.int32(), nullable=False),
160
148
pa.field("baz", pa.bool_(), nullable=True),
161
-
]
162
-
)
149
+
])
163
150
164
151
catalog.create_table(
165
152
identifier="docs_example.bids",
166
153
schema=schema,
167
154
)
168
155
```
169
156
170
-
To create a table with some subsequent changes atomically in a transaction:
157
+
Another API to create a table is using the `create_table_transaction`. This follows the same APIs when making updates to a table. This is a friendly API for both setting the partition specification and sort-order, because you don't have to deal with field-IDs.
171
158
172
159
```python
173
-
with catalog.create_table_transaction(
174
-
identifier="docs_example.bids",
175
-
schema=schema,
176
-
location="s3://pyiceberg",
177
-
partition_spec=partition_spec,
178
-
sort_order=sort_order,
179
-
) as txn:
160
+
with catalog.create_table_transaction(identifier="docs_example.bids", schema=schema) as txn:
@@ -188,6 +169,8 @@ with catalog.create_table_transaction(
188
169
189
170
## Load a table
190
171
172
+
There are two ways of reading an Iceberg table; through a catalog, and by pointing at the Iceberg metadata directly. Reading through a catalog is preferred, and directly pointing at the metadata is read-only.
173
+
191
174
### Catalog table
192
175
193
176
Loading the `bids` table:
@@ -203,7 +186,7 @@ This returns a `Table` that represents an Iceberg table that can be queried and
203
186
204
187
### Static table
205
188
206
-
To load a table directly from a metadata file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
189
+
To load a table directly from a `metadata.json` file (i.e., **without** using a catalog), you can use a `StaticTable` as follows:
Alternatively, if your table metadata directory contains a `version-hint.text` file, you can just specify
219
-
the table root path, and the latest metadata file will be picked automatically.
199
+
The static-table does not allow for write operations. If your table metadata directory contains a `version-hint.text` file, you can just specify the table root path, and the latest `metadata.json` file will be resolved automatically:
With PyIceberg 0.6.0 write support is added through Arrow. Let's consider an Arrow Table:
221
+
Reading and writing is being done using [Apache Arrow](https://arrow.apache.org/). Arrow is an in-memory columnar format for fast data interchange and in-memory analytics. Let's consider the following Arrow Table:
242
222
243
223
```python
244
224
import pyarrow as pa
@@ -253,31 +233,22 @@ df = pa.Table.from_pylist(
253
233
)
254
234
```
255
235
256
-
Next, create a table based on the schema:
236
+
Next, create a table using the Arrow schema:
257
237
258
238
```python
259
239
from pyiceberg.catalog import load_catalog
260
240
261
241
catalog = load_catalog("default")
262
242
263
-
from pyiceberg.schema import Schema
264
-
from pyiceberg.types import NestedField, StringType, DoubleType
Next, write the data to the table. Both `append` and `overwrite` produce the same result, since the table is empty on creation:
276
247
277
248
<!-- prettier-ignore-start -->
278
249
279
250
!!! note inline end "Fast append"
280
-
PyIceberg default to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables quick writes, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a normal commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
251
+
PyIceberg defaults to the [fast append](https://iceberg.apache.org/spec/#snapshots) to minimize the amount of data written. This enables fast commit operations, reducing the possibility of conflicts. The downside of the fast append is that it creates more metadata than a merge commit. [Compaction is planned](https://github.com/apache/iceberg-python/issues/270) and will automatically rewrite all the metadata when a threshold is hit, to maintain performant reads.
281
252
282
253
<!-- prettier-ignore-end -->
283
254
@@ -289,7 +260,7 @@ tbl.append(df)
289
260
tbl.overwrite(df)
290
261
```
291
262
292
-
The data is written to the table, and when the table is read using `tbl.scan().to_arrow()`:
263
+
Now, the data is written to the table, and the table can be read using `tbl.scan().to_arrow()`:
The nested lists indicate the different Arrow buffers, where the first write results into a buffer, and the second append in a separate buffer. This is expected since it will read two parquet files.
329
-
330
-
To avoid any type errors during writing, you can enforce the PyArrow table types using the Iceberg table schema:
297
+
The nested lists indicate the different Arrow buffers. Each of the writes produce a [Parquet file](https://parquet.apache.org/) where each [row group](https://parquet.apache.org/docs/concepts/) translates into an Arrow buffer. In the case where the table is large, PyIceberg also allows the option to stream the buffers using the Arrow [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), avoiding pulling everything into memory right away:
331
298
332
299
```python
333
-
from pyiceberg.catalog import load_catalog
334
-
import pyarrow as pa
300
+
for buf in tbl.scan().to_arrow_batch_reader():
301
+
print(f"Buffer contains {len(buf)} rows")
302
+
```
335
303
336
-
catalog = load_catalog("default")
337
-
table = catalog.load_table("default.cities")
338
-
schema = table.schema().as_arrow()
304
+
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`.
314
+
You can delete some of the data from the table by calling `tbl.delete()` with a desired `delete_filter`. This will use the Iceberg metadata to only open up the Parquet files that contain relevant information.
348
315
349
316
```python
350
317
tbl.delete(delete_filter="city == 'Paris'")
351
318
```
352
319
353
-
In the above example, any records where the city field value equals to `Paris` will be deleted.
354
-
Running `tbl.scan().to_arrow()` will now yield:
320
+
In the above example, any records where the city field value equals to `Paris` will be deleted. Running `tbl.scan().to_arrow()` will now yield:
When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table.
370
-
371
-
For example, with an iceberg table created as:
372
-
373
-
```python
374
-
from pyiceberg.catalog import load_catalog
375
-
376
-
catalog = load_catalog("default")
377
-
378
-
from pyiceberg.schema import Schema
379
-
from pyiceberg.types import NestedField, StringType, DoubleType
333
+
In the case of `tbl.delete(delete_filter="city == 'Groningen'")`, the whole Parquet file will be dropped without checking it contents, since from the Iceberg metadata PyIceberg can derive that all the content in the file matches the predicate.
When using the `overwrite` API, you can use an `overwrite_filter` to delete data that matches the filter before appending new data into the table. For example, consider the following Iceberg table:
0 commit comments