Skip to content

Commit 2ba2f6b

Browse files
authored
Fixing checkpoints docs example (#1637)
* fixing checkpoints docs example * fixed docs comments
1 parent 507d11c commit 2ba2f6b

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

docs/guide/checkpoints.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -48,34 +48,34 @@ Consider this script that processes data in multiple stages:
4848
```python
4949
import datachain as dc
5050

51-
# Stage 1: Load and filter data
51+
# Stage 1: List and filter files
5252
filtered = (
53-
dc.read_csv("s3://mybucket/data.csv")
54-
.filter(dc.C("score") > 0.5)
55-
.save("filtered_data")
53+
dc.read_storage("gs://datachain-demo/dogs-and-cats/", anon=True)
54+
.filter(dc.C("file.path").glob("*.jpg"))
55+
.save("filtered_files")
5656
)
5757

5858
# Stage 2: Transform data
5959
transformed = (
6060
filtered
61-
.map(value=lambda x: x * 2, output=float)
61+
.map(size_kb=lambda file: file.size / 1024, output=float)
6262
.save("transformed_data")
6363
)
6464

6565
# Stage 3: Aggregate results
6666
result = (
6767
transformed
6868
.agg(
69-
total=lambda values: sum(values),
70-
partition_by="category",
69+
total=lambda size_kb: [sum(size_kb)],
70+
output=float,
7171
)
7272
.save("final_results")
7373
)
7474
```
7575

76-
**First run:** The script executes all three stages and creates three datasets: `filtered_data`, `transformed_data`, and `final_results`. If the script fails during Stage 3, only `filtered_data` and `transformed_data` are saved.
76+
**First run:** The script executes all three stages and creates three datasets: `filtered_files`, `transformed_data`, and `final_results`. If the script fails during Stage 3, only `filtered_files` and `transformed_data` are saved.
7777

78-
**Second run:** DataChain detects that `filtered_data` and `transformed_data` were already created in the previous run with matching hashes. It skips recreating them and proceeds directly to Stage 3, creating only `final_results`.
78+
**Second run:** DataChain detects that `filtered_files` and `transformed_data` were already created in the previous run with matching hashes. It skips recreating them and proceeds directly to Stage 3, creating only `final_results`.
7979

8080
## When Checkpoints Are Used
8181

@@ -151,23 +151,23 @@ Changes that invalidate checkpoints include:
151151

152152
```python
153153
# First run - creates three checkpoints
154-
dc.read_csv("data.csv").save("stage1") # Hash = H1
154+
dc.read_storage("gs://datachain-demo/dogs-and-cats/", anon=True).save("stage1") # Hash = H1
155155

156-
dc.read_dataset("stage1").filter(dc.C("x") > 5).save("stage2") # Hash = H2 = hash(H1 + pipeline_hash)
156+
dc.read_dataset("stage1").filter(dc.C("file.path").glob("*.jpg")).save("stage2") # Hash = H2 = hash(H1 + pipeline_hash)
157157

158-
dc.read_dataset("stage2").select("name", "value").save("stage3") # Hash = H3 = hash(H2 + pipeline_hash)
158+
dc.read_dataset("stage2").select("file").save("stage3") # Hash = H3 = hash(H2 + pipeline_hash)
159159
```
160160

161161
**Second run (no changes):**
162162
- All three hashes match → all three datasets are reused → no computation
163163

164164
**Second run (modified filter):**
165165
```python
166-
dc.read_csv("data.csv").save("stage1") # Hash = H1 matches ✓ → reused
166+
dc.read_storage("gs://datachain-demo/dogs-and-cats/", anon=True).save("stage1") # Hash = H1 matches ✓ → reused
167167

168-
dc.read_dataset("stage1").filter(dc.C("x") > 10).save("stage2") # Hash ≠ H2 ✗ → recomputed
168+
dc.read_dataset("stage1").filter(dc.C("file.path").glob("*.png")).save("stage2") # Hash ≠ H2 ✗ → recomputed
169169

170-
dc.read_dataset("stage2").select("name", "value").save("stage3") # Hash ≠ H3 ✗ → recomputed
170+
dc.read_dataset("stage2").select("file").save("stage3") # Hash ≠ H3 ✗ → recomputed
171171
```
172172

173173
Because the filter changed, `stage2` has a different hash and must be recomputed. Since `stage3` depends on `stage2`, its hash also changes (because it includes H2 in the calculation), so it must be recomputed as well.

0 commit comments

Comments
 (0)