Skip to content

Docs: document full artifact contract per pipeline stage, aligned with smoke-test expectations #92

@adidev001

Description

@adidev001

Artifact layout in pipeline docs doesn't fully match what the smoke test expects

While setting up the repo locally and reading through the pipeline docs alongside
the smoke test, I noticed some artifacts that test_pipeline_smoke.py checks for
aren't really surfaced in docs/pipeline.md.

The docs do a good job covering the main outputs of each stage, but a few
runtime and debug artifacts that show up during actual runs are easy to miss
as a new contributor:

  • clean_markdown/ — produced by Corpus.clean() separately from markdown/
  • .processing_state.pkl — the resume/state file persisted across runs
  • problematic_files/ and timeout_files/ — triage outputs from cleaning

I ran into this because the contribution notes say that when outputs move or
change, docs/pipeline.md and tests/test_pipeline_smoke.py should be updated
together. That made me want to check whether the two were actually in sync,
and they're not quite there yet.

What I'd propose

I'd like to update docs/pipeline.md to reflect the full per-stage artifact
layout, matching what the smoke test asserts. I think it would also help to
distinguish between:

  • Canonical outputs — the intended deliverables of each stage
  • Runtime/resume artifacts — things like .processing_state.pkl that
    support crash recovery
  • Debug/triage artifacts — folders like problematic_files/ that help
    diagnose failures

This way, the next contributor who modifies a stage output has a clear
reference for what should exist, what the smoke test will check, and what
needs to be updated together.

Happy to hear if this taxonomy makes sense or if you'd prefer a different
way to organise it before I draft the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions