You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/usage/working-with-partitions.md
+19-13Lines changed: 19 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ Below, we demonstrate how to create, query, and update partitioned Delta tables,
9
9
10
10
To create a partitioned Delta table, specify one or more partition columns when creating the table. Here we partition by the country column.
11
11
```python
12
-
from deltalake import write_deltalake
12
+
from deltalake import write_deltalake,DeltaTable
13
13
import pandas as pd
14
14
15
15
df = pd.DataFrame({
@@ -98,9 +98,10 @@ print(pdf)
98
98
99
99
### Overwriting a Partition
100
100
101
-
You can overwrite a specific partition, leaving the other partitions intact. Pass in `mode="overwrite"` together with a predicate string.
101
+
To overwrite a specific partition or partitions set `mode="overwrite"` together with a predicate string that specifies
102
+
which partitions are present in the new data. By setting the predicate `deltalake` is able to skip the other partitions.
102
103
103
-
In this example we overwrite the `DE`paritition with new data.
104
+
In this example we overwrite the `DE`partition with new data.
104
105
105
106
```python
106
107
df_overwrite = pd.DataFrame({
@@ -134,16 +135,17 @@ print(pdf)
134
135
135
136
## Updating Partitioned Tables with Merge
136
137
137
-
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. Simply provide a matching predicate that references partition columns if needed.
138
+
You can perform merge operations on partitioned tables in the same way you do on non-partitioned ones. If only a subset of existing partitions need to be read then provide a matching predicate that references the partition columns represented in the source data. The predicate then allows `deltalake` to skip reading the partitions not referenced by the predicate.
138
139
139
-
You can match on both the partition column (country) and some other condition. This example shows a merge operation that checks both the partition column ("country") and a numeric column ("num") when merging:
140
+
This example shows a merge operation that checks both the partition column (`"country"`) and another column (`"num"`) when merging:
140
141
- The merge condition (predicate) matches target rows where both "country" and "num" align with the source.
141
-
- When a match occurs, it updates the "letter" column; otherwise, it inserts the new row.
142
+
- If a match is found between a source row and a target row, the `"letter"` column is updated with the source data
143
+
- Otherwise if no match is found for a source row it inserts the new row, creating a new partition if necessary
This command logically deletes the data by creating a new transaction.
201
204
@@ -204,6 +207,9 @@ This command logically deletes the data by creating a new transaction.
204
207
### Optimize & Vacuum
205
208
206
209
Partitioned tables can accummulate many small files if a partition is frequently appended to. You can compact these into larger files on a specific partition with [`optimize.compact`](../../delta_table/#deltalake.DeltaTable.optimize).
210
+
211
+
If we want to target compaction at specific partitions we can include partition filters.
@@ -212,4 +218,4 @@ Then optionally [`vacuum`](../../delta_table/#deltalake.DeltaTable.vacuum) the t
212
218
213
219
### Handling High-Cardinality Columns
214
220
215
-
Partitioning can be very powerful, but be mindful of using high-cardinality columns (columns with too many unique values). This can create an excessive number of directories and can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.
221
+
Partitioning can be useful for reducing the time it takes to update and query a table, but be mindful of creating partitions against high-cardinality columns (columns with many unique values). Doing so can create an excessive number of partition directories which can hurt performance. For example, partitioning by date is typically better than partitioning by user_id if user_id has millions of unique values.
0 commit comments