Skip to content

Commit 93a79b9

Browse files
authored
Merge branch 'main' into refactor/consolidate-snapshot-expiration
2 parents 2e7e4cb + dc43940 commit 93a79b9

25 files changed

+621
-257
lines changed

.github/workflows/pypi-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ jobs:
6262
if: startsWith(matrix.os, 'ubuntu')
6363

6464
- name: Build wheels
65-
uses: pypa/[email protected].0
65+
uses: pypa/[email protected].1
6666
with:
6767
output-dir: wheelhouse
6868
config-file: "pyproject.toml"

.github/workflows/svn-build-artifacts.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ jobs:
5757
if: startsWith(matrix.os, 'ubuntu')
5858

5959
- name: Build wheels
60-
uses: pypa/[email protected].0
60+
uses: pypa/[email protected].1
6161
with:
6262
output-dir: wheelhouse
6363
config-file: "pyproject.toml"

dev/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
FROM python:3.9-bullseye
16+
FROM python:3.12-bullseye
1717

1818
RUN apt-get -qq update && \
1919
apt-get -qq install -y --no-install-recommends \
@@ -63,7 +63,7 @@ RUN chmod u+x /opt/spark/sbin/* && \
6363

6464
RUN pip3 install -q ipython
6565

66-
RUN pip3 install "pyiceberg[s3fs,hive]==${PYICEBERG_VERSION}"
66+
RUN pip3 install "pyiceberg[s3fs,hive,pyarrow]==${PYICEBERG_VERSION}"
6767

6868
COPY entrypoint.sh .
6969
COPY provision.py .

mkdocs/docs/api.md

Lines changed: 100 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1106,7 +1106,106 @@ maintenance.expire_snapshots_with_retention_policy(
11061106
)
11071107
```
11081108

1109-
#### Example: Combined policy
1109+
Using [Ray Dataset API](https://docs.ray.io/en/latest/data/api/dataset.html) to interact with the dataset:
1110+
1111+
```python
1112+
print(ray_dataset.take(2))
1113+
[
1114+
{
1115+
"VendorID": 2,
1116+
"tpep_pickup_datetime": datetime.datetime(2008, 12, 31, 23, 23, 50),
1117+
"tpep_dropoff_datetime": datetime.datetime(2009, 1, 1, 0, 34, 31),
1118+
},
1119+
{
1120+
"VendorID": 2,
1121+
"tpep_pickup_datetime": datetime.datetime(2008, 12, 31, 23, 5, 3),
1122+
"tpep_dropoff_datetime": datetime.datetime(2009, 1, 1, 16, 10, 18),
1123+
},
1124+
]
1125+
```
1126+
1127+
### Daft
1128+
1129+
PyIceberg interfaces closely with Daft Dataframes (see also: [Daft integration with Iceberg](https://docs.daft.ai/en/stable/io/iceberg/)) which provides a full lazily optimized query engine interface on top of PyIceberg tables.
1130+
1131+
<!-- prettier-ignore-start -->
1132+
1133+
!!! note "Requirements"
1134+
This requires [Daft to be installed](index.md).
1135+
1136+
<!-- prettier-ignore-end -->
1137+
1138+
A table can be read easily into a Daft Dataframe:
1139+
1140+
```python
1141+
df = table.to_daft() # equivalent to `daft.read_iceberg(table)`
1142+
df = df.where(df["trip_distance"] >= 10.0)
1143+
df = df.select("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime")
1144+
```
1145+
1146+
This returns a Daft Dataframe which is lazily materialized. Printing `df` will display the schema:
1147+
1148+
```python
1149+
╭──────────┬───────────────────────────────┬───────────────────────────────╮
1150+
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
1151+
---------
1152+
│ Int64 ┆ Timestamp(Microseconds, None) ┆ Timestamp(Microseconds, None) │
1153+
╰──────────┴───────────────────────────────┴───────────────────────────────╯
1154+
1155+
(No data to display: Dataframe not materialized)
1156+
```
1157+
1158+
We can execute the Dataframe to preview the first few rows of the query with `df.show()`.
1159+
1160+
This is correctly optimized to take advantage of Iceberg features such as hidden partitioning and file-level statistics for efficient reads.
1161+
1162+
```python
1163+
df.show(2)
1164+
```
1165+
1166+
```python
1167+
╭──────────┬───────────────────────────────┬───────────────────────────────╮
1168+
│ VendorID ┆ tpep_pickup_datetime ┆ tpep_dropoff_datetime │
1169+
---------
1170+
│ Int64 ┆ Timestamp(Microseconds, None) ┆ Timestamp(Microseconds, None) │
1171+
╞══════════╪═══════════════════════════════╪═══════════════════════════════╡
1172+
22008-12-31T23:23:50.0000002009-01-01T00:34:31.000000
1173+
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
1174+
22008-12-31T23:05:03.0000002009-01-01T16:10:18.000000
1175+
╰──────────┴───────────────────────────────┴───────────────────────────────╯
1176+
1177+
(Showing first 2 rows)
1178+
```
1179+
1180+
### Polars
1181+
1182+
PyIceberg interfaces closely with Polars Dataframes and LazyFrame which provides a full lazily optimized query engine interface on top of PyIceberg tables.
1183+
1184+
<!-- prettier-ignore-start -->
1185+
1186+
!!! note "Requirements"
1187+
This requires [`polars` to be installed](index.md).
1188+
1189+
```python
1190+
pip install pyiceberg['polars']
1191+
```
1192+
<!-- prettier-ignore-end -->
1193+
1194+
PyIceberg data can be analyzed and accessed through Polars using either DataFrame or LazyFrame.
1195+
If your code utilizes the Apache Iceberg data scanning and retrieval API and then analyzes the resulting DataFrame in Polars, use the `table.scan().to_polars()` API.
1196+
If the intent is to utilize Polars' high-performance filtering and retrieval functionalities, use LazyFrame exported from the Iceberg table with the `table.to_polars()` API.
1197+
1198+
```python
1199+
# Get LazyFrame
1200+
iceberg_table.to_polars()
1201+
1202+
# Get Data Frame
1203+
iceberg_table.scan().to_polars()
1204+
```
1205+
1206+
#### Working with Polars DataFrame
1207+
1208+
PyIceberg makes it easy to filter out data from a huge table and pull it into a Polars dataframe locally. This will only fetch the relevant Parquet files for the query and apply the filter. This will reduce IO and therefore improve performance and reduce cost.
11101209

11111210
```python
11121211
# Expire old snapshots, but always keep last 10 and at least 5 total

mkdocs/docs/cli.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,3 +219,19 @@ Or output in JSON for automation:
219219
}
220220
}
221221
```
222+
223+
You can also add, update or remove properties on tables or namespaces:
224+
225+
```sh
226+
➜ pyiceberg properties set table nyc.taxis write.metadata.delete-after-commit.enabled true
227+
Set write.metadata.delete-after-commit.enabled=true on nyc.taxis
228+
229+
➜ pyiceberg properties get table nyc.taxis
230+
write.metadata.delete-after-commit.enabled true
231+
232+
➜ pyiceberg properties remove table nyc.taxis write.metadata.delete-after-commit.enabled
233+
Property write.metadata.delete-after-commit.enabled removed from nyc.taxis
234+
235+
➜ pyiceberg properties get table nyc.taxis write.metadata.delete-after-commit.enabled
236+
Could not find property write.metadata.delete-after-commit.enabled on nyc.taxis
237+
```

mkdocs/docs/configuration.md

Lines changed: 90 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -339,40 +339,111 @@ catalog:
339339
340340
| Key | Example | Description |
341341
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
342-
| uri | <https://rest-catalog/ws> | URI identifying the REST Server |
343-
| ugi | t-1234:secret | Hadoop UGI for Hive client. |
344-
| credential | t-1234:secret | Credential to use for OAuth2 credential flow when initializing the catalog |
345-
| token | FEW23.DFSDF.FSDF | Bearer token value to use for `Authorization` header |
342+
| uri | <https://rest-catalog/ws> | URI identifying the REST Server |
343+
| warehouse | myWarehouse | Warehouse location or identifier to request from the catalog service. May be used to determine server-side overrides, such as the warehouse location. |
344+
| snapshot-loading-mode | refs | The snapshots to return in the body of the metadata. Setting the value to `all` would return the full set of snapshots currently valid for the table. Setting the value to `refs` would load all snapshots referenced by branches or tags. |
345+
| `header.X-Iceberg-Access-Delegation` | `vended-credentials` | Signal to the server that the client supports delegated access via a comma-separated list of access mechanisms. The server may choose to supply access via any or none of the requested mechanisms. When using `vended-credentials`, the server provides temporary credentials to the client. When using `remote-signing`, the server signs requests on behalf of the client. (default: `vended-credentials`) |
346+
347+
#### Headers in REST Catalog
348+
349+
To configure custom headers in REST Catalog, include them in the catalog properties with `header.<Header-Name>`. This
350+
ensures that all HTTP requests to the REST service include the specified headers.
351+
352+
```yaml
353+
catalog:
354+
default:
355+
uri: http://rest-catalog/ws/
356+
credential: t-1234:secret
357+
header.content-type: application/vnd.api+json
358+
```
359+
360+
#### Authentication Options
361+
362+
##### OAuth2
363+
364+
| Key | Example | Description |
365+
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
366+
| oauth2-server-uri | <https://auth-service/cc> | Authentication URL to use for client credentials authentication (default: uri + 'v1/oauth/tokens') |
367+
| token | FEW23.DFSDF.FSDF | Bearer token value to use for `Authorization` header |
368+
| credential | client_id:client_secret | Credential to use for OAuth2 credential flow when initializing the catalog |
346369
| scope | openid offline corpds:ds:profile | Desired scope of the requested security token (default : catalog) |
347370
| resource | rest_catalog.iceberg.com | URI for the target resource or service |
348371
| audience | rest_catalog | Logical name of target resource or service |
372+
373+
##### SigV4
374+
375+
| Key | Example | Description |
376+
| ------------------- | -------------------------------- | -------------------------------------------------------------------------------------------------- |
349377
| rest.sigv4-enabled | true | Sign requests to the REST Server using AWS SigV4 protocol |
350378
| rest.signing-region | us-east-1 | The region to use when SigV4 signing a request |
351379
| rest.signing-name | execute-api | The service signing name to use when SigV4 signing a request |
352-
| oauth2-server-uri | <https://auth-service/cc> | Authentication URL to use for client credentials authentication (default: uri + 'v1/oauth/tokens') |
353-
| snapshot-loading-mode | refs | The snapshots to return in the body of the metadata. Setting the value to `all` would return the full set of snapshots currently valid for the table. Setting the value to `refs` would load all snapshots referenced by branches or tags. |
354-
| warehouse | myWarehouse | Warehouse location or identifier to request from the catalog service. May be used to determine server-side overrides, such as the warehouse location. |
355380

356381
<!-- markdown-link-check-enable-->
357382

358-
#### Headers in RESTCatalog
383+
#### Common Integrations & Examples
359384

360-
To configure custom headers in RESTCatalog, include them in the catalog properties with the prefix `header.`. This
361-
ensures that all HTTP requests to the REST service include the specified headers.
385+
##### AWS Glue
362386

363387
```yaml
364388
catalog:
365-
default:
366-
uri: http://rest-catalog/ws/
367-
credential: t-1234:secret
368-
header.content-type: application/vnd.api+json
389+
s3_tables_catalog:
390+
type: rest
391+
uri: https://glue.<region>.amazonaws.com/iceberg
392+
warehouse: <account-id>:s3tablescatalog/<table-bucket-name>
393+
rest.sigv4-enabled: true
394+
rest.signing-name: glue
395+
rest.signing-region: <region>
396+
```
397+
398+
##### Unity Catalog
399+
400+
```yaml
401+
catalog:
402+
unity_catalog:
403+
type: rest
404+
uri: https://<workspace-url>/api/2.1/unity-catalog/iceberg-rest
405+
warehouse: <uc-catalog-name>
406+
token: <databricks-pat-token>
407+
```
408+
409+
##### R2 Data Catalog
410+
411+
```yaml
412+
catalog:
413+
r2_catalog:
414+
type: rest
415+
uri: <r2-catalog-uri>
416+
warehouse: <r2-warehouse-name>
417+
token: <r2-token>
369418
```
370419

371-
Specific headers defined by the RESTCatalog spec include:
420+
##### Lakekeeper
421+
422+
```yaml
423+
catalog:
424+
lakekeeper_catalog:
425+
type: rest
426+
uri: <lakekeeper-catalog-uri>
427+
warehouse: <lakekeeper-warehouse-name>
428+
credential: <client-id>:<client-secret>
429+
oauth2-server-uri: http://localhost:30080/realms/<keycloak-realm-name>/protocol/openid-connect/token
430+
scope: lakekeeper
431+
```
372432

373-
| Key | Options | Default | Description |
374-
| ------------------------------------ | ------------------------------------- | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
375-
| `header.X-Iceberg-Access-Delegation` | `{vended-credentials,remote-signing}` | `vended-credentials` | Signal to the server that the client supports delegated access via a comma-separated list of access mechanisms. The server may choose to supply access via any or none of the requested mechanisms |
433+
##### Apache Polaris
434+
435+
```yaml
436+
catalog:
437+
polaris_catalog:
438+
type: rest
439+
uri: https://<account>.snowflakecomputing.com/polaris/api/catalog
440+
warehouse: <polaris-catalog-name>
441+
credential: <client-id>:<client-secret>
442+
header.X-Iceberg-Access-Delegation: vended-credentials
443+
scope: PRINCIPAL_ROLE:ALL
444+
token-refresh-enabled: true
445+
py-io-impl: pyiceberg.io.fsspec.FsspecFileIO
446+
```
376447

377448
### SQL Catalog
378449

@@ -444,6 +515,7 @@ catalog:
444515
| hive.hive2-compatible | true | Using Hive 2.x compatibility mode |
445516
| hive.kerberos-authentication | true | Using authentication via Kerberos |
446517
| hive.kerberos-service-name | hive | Kerberos service name (default hive) |
518+
| ugi | t-1234:secret | Hadoop UGI for Hive client. |
447519

448520
When using Hive 2.x, make sure to set the compatibility flag:
449521

0 commit comments

Comments
 (0)