Skip to content

Commit 44faeb8

Browse files
Liam BranniganLiam Brannigan
authored andcommitted
Add review changes
1 parent d7e13eb commit 44faeb8

File tree

1 file changed

+19
-13
lines changed

1 file changed

+19
-13
lines changed

docs/usage/working-with-partitions.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Below, we demonstrate how to create, query, and update partitioned Delta tables,
99

1010
To create a partitioned Delta table, specify one or more partition columns when creating the table. Here we partition by the country column.
1111
```python
12-
from deltalake import write_deltalake
12+
from deltalake import write_deltalake,DeltaTable
1313
import pandas as pd
1414

1515
df = pd.DataFrame({
@@ -98,9 +98,10 @@ print(pdf)
9898

9999
### Overwriting a Partition
100100

101-
You can overwrite a specific partition, leaving the other partitions intact. Pass in `mode="overwrite"` together with a predicate string.
101+
To overwrite a specific partition or partitions set `mode="overwrite"` together with a predicate string that specifies
102+
which partitions are present in the new data. By setting the predicate `deltalake` is able to skip the other partitions.
102103

103-
In this example we overwrite the `DE` paritition with new data.
104+
In this example we overwrite the `DE` partition with new data.
104105

105106
```python
106107
df_overwrite = pd.DataFrame({
@@ -134,16 +135,17 @@ print(pdf)
134135

135136
## Updating Partitioned Tables with Merge
136137

137-
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. Simply provide a matching predicate that references partition columns if needed.
138+
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. If only a subset of existing partitions need to be read then provide a matching predicate that references the partition columns represented in the source data. The predicate then allows `deltalake` to skip reading the partitions not referenced by the predicate.
138139

139-
You can match on both the partition column (country) and some other condition. This example shows a merge operation that checks both the partition column ("country") and a numeric column ("num") when merging:
140+
This example shows a merge operation that checks both the partition column (`"country"`) and another column (`"num"`) when merging:
140141
- The merge condition (predicate) matches target rows where both "country" and "num" align with the source.
141-
- When a match occurs, it updates the "letter" column; otherwise, it inserts the new row.
142+
- If a match is found between a source row and a target row, the `"letter"` column is updated with the source data
143+
- Otherwise if no match is found for a source row it inserts the new row, creating a new partition if necessary
142144

143145
```python
144146
dt = DeltaTable("tmp/partitioned-table")
145147

146-
source_data = pd.DataFrame({"num": [1, 101], "letter": ["A", "B"], "country": ["US", "US"]})
148+
source_data = pd.DataFrame({"num": [1, 101], "letter": ["A", "B"], "country": ["US", "CH"]})
147149

148150
(
149151
dt.merge(
@@ -166,7 +168,7 @@ print(pdf)
166168

167169
```plaintext
168170
num letter country
169-
0 101 B US
171+
0 101 B CH
170172
1 1 A US
171173
2 2 b US
172174
3 900 m DE
@@ -192,10 +194,11 @@ print(pdf)
192194

193195
```plaintext
194196
num letter country
195-
0 900 m DE
196-
1 1000 n DE
197-
2 10 x CA
198-
3 3 c CA
197+
0 101 B CH
198+
1 900 m DE
199+
2 1000 n DE
200+
3 10 x CA
201+
4 3 c CA
199202
```
200203
This command logically deletes the data by creating a new transaction.
201204

@@ -204,6 +207,9 @@ This command logically deletes the data by creating a new transaction.
204207
### Optimize & Vacuum
205208

206209
Partitioned tables can accummulate many small files if a partition is frequently appended to. You can compact these into larger files on a specific partition with [`optimize.compact`](../../delta_table/#deltalake.DeltaTable.optimize).
210+
211+
If we want to target compaction at specific partitions we can include partition filters.
212+
207213
```python
208214
dt.optimize.compact(partition_filters=[("country", "=", "CA")])
209215
```
@@ -212,4 +218,4 @@ Then optionally [`vacuum`](../../delta_table/#deltalake.DeltaTable.vacuum) the t
212218

213219
### Handling High-Cardinality Columns
214220

215-
Partitioning can be very powerful, but be mindful of using high-cardinality columns (columns with too many unique values). This can create an excessive number of directories and can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.
221+
Partitioning can be useful for reducing the time it takes to update and query a table, but be mindful of creating partitions against high-cardinality columns (columns with many unique values). Doing so can create an excessive number of partition directories which can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.

0 commit comments

Comments
 (0)