Skip to content

Commit 362c59a

Browse files
committed
feat: minio and pyspark docs added
1 parent 31587f9 commit 362c59a

File tree

5 files changed

+125
-0
lines changed

5 files changed

+125
-0
lines changed

docs/data_engineering/data_lakehouse/apache_iceberg.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,3 +103,7 @@ async fn main() {
103103
println!("{:?}", table_created.metadata());
104104
}
105105
```
106+
107+
### Insert data
108+
109+
The iceberg-rust package currently lacks write support ([source](https://github.com/apache/iceberg-rust/issues/700)).

docs/data_engineering/data_lakehouse/apache_spark.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,76 @@ services:
3737
depends_on:
3838
- spark_master
3939
```
40+
41+
## Python library
42+
43+
-> [Minio](../../dev_ops/services/minio.md) as local s3 service
44+
45+
### Apache Iceberg integration
46+
47+
```python
48+
from pyspark.sql import SparkSession
49+
50+
spark = (
51+
SparkSession.builder.master('spark://localhost:7077')
52+
.config(
53+
'spark.jars.packages',
54+
'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,' 'org.apache.iceberg:iceberg-aws-bundle:1.7.1,' 'org.postgresql:postgresql:42.7.4',
55+
)
56+
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
57+
.config('spark.sql.catalog.my_catalog', 'org.apache.iceberg.spark.SparkCatalog')
58+
.config('spark.sql.catalog.my_catalog.type', 'hadoop')
59+
.config('spark.sql.catalog.my_catalog.type', 'jdbc')
60+
.config('spark.sql.catalog.my_catalog.uri', 'jdbc:postgresql://localhost:5500/postgres')
61+
.config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
62+
.config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
63+
.config('spark.sql.catalog.my_catalog.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
64+
.config('spark.sql.catalog.my_catalog.warehouse', 's3://data-lakehouse')
65+
.config('spark.sql.catalog.my_catalog.s3.region', 'us-east-1')
66+
.config('spark.sql.catalog.my_catalog.s3.endpoint', 'http://YOUR_IP_ADDRESS:5561')
67+
.config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
68+
.config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
69+
.getOrCreate()
70+
)
71+
72+
spark.sql('CREATE TABLE my_catalog.table (name string) USING iceberg;')
73+
spark.sql("INSERT INTO my_catalog.table VALUES ('Alex'), ('Dipankar'), ('Jason')")
74+
```
75+
76+
### Apache Iceberg + Sedona
77+
78+
```python
79+
from sedona.spark import SedonaContext
80+
81+
spark = (
82+
SedonaContext.builder()
83+
.master('spark://localhost:7077')
84+
.config(
85+
'spark.jars.packages',
86+
# sedona
87+
'org.apache.sedona:sedona-spark-3.5_2.12:1.7.0,'
88+
'org.datasyslab:geotools-wrapper:1.7.0-28.5,'
89+
# iceberg
90+
'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.1,'
91+
'org.apache.iceberg:iceberg-aws-bundle:1.7.1,'
92+
'org.postgresql:postgresql:42.7.4',
93+
)
94+
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
95+
.config('spark.sql.catalog.my_catalog', 'org.apache.iceberg.spark.SparkCatalog')
96+
.config('spark.sql.catalog.my_catalog.type', 'jdbc')
97+
.config('spark.sql.catalog.my_catalog.uri', 'jdbc:postgresql://localhost:5500/postgres')
98+
.config('spark.sql.catalog.my_catalog.jdbc.user', 'postgres')
99+
.config('spark.sql.catalog.my_catalog.jdbc.password', 'postgres')
100+
.config('spark.sql.catalog.my_catalog.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
101+
.config('spark.sql.catalog.my_catalog.warehouse', 's3://data-lakehouse')
102+
.config('spark.sql.catalog.my_catalog.s3.region', 'us-east-1')
103+
.config('spark.sql.catalog.my_catalog.s3.endpoint', 'http://YOUR_IP_ADDRESS:5561')
104+
.config('spark.sql.catalog.my_catalog.s3.access-key-id', 'admin')
105+
.config('spark.sql.catalog.my_catalog.s3.secret-access-key', 'password')
106+
.config('spark.sql.catalog.my_catalog.s3.path-style-access', 'true')
107+
.getOrCreate()
108+
)
109+
110+
spark.sql('CREATE TABLE my_catalog.table8 (name string) USING iceberg;')
111+
spark.sql("INSERT INTO my_catalog.table8 VALUES ('Alex'), ('Dipankar'), ('Jason')")
112+
```

docs/dev_ops/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
title: ⚙️ Dev Ops

docs/dev_ops/services/.pages

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
title: 📦 Services

docs/dev_ops/services/minio.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# 🪣 Minio
2+
3+
## Docker compose
4+
5+
`docker_compose.yaml`:
6+
7+
```yaml
8+
storage_s3:
9+
restart: always
10+
image: quay.io/minio/minio:RELEASE.2024-10-29T16-01-48Z
11+
ports:
12+
- 5560:5560
13+
- 5561:5561
14+
hostname: storage-s3
15+
environment:
16+
MINIO_ROOT_USER: admin
17+
MINIO_ROOT_PASSWORD: password
18+
command: server /data --console-address ":5560" --address=":5561"
19+
healthcheck:
20+
test: ["CMD", "curl", "-f", "http://localhost:5560/minio/health/live"]
21+
interval: 5s
22+
timeout: 5s
23+
retries: 5
24+
25+
storage_s3_initial_setup:
26+
image: minio/mc:RELEASE.2024-10-29T15-34-59Z
27+
depends_on:
28+
storage_s3:
29+
condition: service_healthy
30+
volumes:
31+
- ./docker_entrypoint.sh:/docker_entrypoint.sh:z
32+
entrypoint:
33+
- /docker_entrypoint.sh
34+
```
35+
36+
`docker_entrypoint.sh`:
37+
38+
```sh
39+
#!/bin/bash
40+
41+
# Set up alias for MinIO
42+
mc alias set minio http://storage-s3:5561 admin password;
43+
44+
# Create buckets
45+
mc mb minio/data-lakehouse;
46+
```

0 commit comments

Comments
 (0)