-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Artifact layout in pipeline docs doesn't fully match what the smoke test expects
While setting up the repo locally and reading through the pipeline docs alongside
the smoke test, I noticed some artifacts that test_pipeline_smoke.py checks for
aren't really surfaced in docs/pipeline.md.
The docs do a good job covering the main outputs of each stage, but a few
runtime and debug artifacts that show up during actual runs are easy to miss
as a new contributor:
clean_markdown/— produced byCorpus.clean()separately frommarkdown/.processing_state.pkl— the resume/state file persisted across runsproblematic_files/andtimeout_files/— triage outputs from cleaning
I ran into this because the contribution notes say that when outputs move or
change, docs/pipeline.md and tests/test_pipeline_smoke.py should be updated
together. That made me want to check whether the two were actually in sync,
and they're not quite there yet.
What I'd propose
I'd like to update docs/pipeline.md to reflect the full per-stage artifact
layout, matching what the smoke test asserts. I think it would also help to
distinguish between:
- Canonical outputs — the intended deliverables of each stage
- Runtime/resume artifacts — things like
.processing_state.pklthat
support crash recovery - Debug/triage artifacts — folders like
problematic_files/that help
diagnose failures
This way, the next contributor who modifies a stage output has a clear
reference for what should exist, what the smoke test will check, and what
needs to be updated together.
Happy to hear if this taxonomy makes sense or if you'd prefer a different
way to organise it before I draft the PR.