Data Inspection Tool by Helw150 · Pull Request #2904 · marin-community/marin

Helw150 · 2026-02-20T01:17:12Z

Add scripts/training/inspect_data.py — a CLI tool to inspect decoded training data at a given step on a Ray cluster, useful for debugging data pipelines, mixture weights, and masking

  # Dump a single step's batch as JSONL for agents
  uv run scripts/training/inspect_data.py experiments/references/canary_train.py --step 100 --cluster us-central2

  # Per-step summary stats across multiple steps for agents
  uv run scripts/training/inspect_data.py experiments/references/canary_train.py --steps 0,100,500 --cluster us-central2 --summary

  # Interactive TUI browser for people
  uv run scripts/training/inspect_data.py experiments/references/canary_train.py --step 100 --cluster us-central2 --tui

claude · 2026-02-20T01:17:28Z

Claude finished @Helw150's task in 2m 1s —— View job

Review: Data Inspection Tool

Read changed file and gather context
Verify internal API usage against levanter/marin
Post review

This PR adds a CLI tool (scripts/debug/inspect_data.py) that decodes and dumps training batches at a given step, supporting JSONL output, summary stats, and an interactive TUI. The internal API usage against MixtureDataset, BatchSchedule, and TrainLmOnPodConfig is correct.

Specification check: This PR is 587 lines with no linked specification or design doc. Per docs/recipes/agent-coding.md, a specification is expected for PRs of this size.

Bugs

Broken Ray job submission path (scripts/debug/inspect_data.py:87): The entrypoint is hardcoded as scripts/training/inspect_data.py, but the file actually lives at scripts/debug/inspect_data.py. When --cluster is used (the primary use case), the submitted Ray job will fail because the entrypoint path doesn't exist. The docstring usage examples (lines 10-17) have the same stale path. Fix this →
TUI Page Down scroll calculation is inconsistent (scripts/debug/inspect_data.py:410-412): Page Down estimates total lines as len(text.split("\n")) * 2, but draw() computes wrapped lines properly via textwrap.wrap. This means Page Down can overshoot or undershoot the actual content length. The draw() function already computes wrapped correctly — Page Down/Up should use the same wrapping logic or cache the wrapped line count.

Guidelines

The code accesses several internal/private methods of MixtureDataset (_get_block, _get_stage_for_block). These are implementation details that could change without notice. Consider whether MixtureDataset should expose a small public API surface for inspection use cases, or at minimum add a comment acknowledging the coupling.

claude · 2026-02-20T01:19:47Z

scripts/debug/inspect_data.py

+
+    config = _resolve_cluster_config(cluster)
+
+    parts = ["python", "scripts/training/inspect_data.py", experiment]


Bug: This path doesn't match the actual file location. The file is at scripts/debug/inspect_data.py, so the Ray job will fail to find the entrypoint.

Suggested change

parts = ["python", "scripts/training/inspect_data.py", experiment]

parts = ["python", "scripts/debug/inspect_data.py", experiment]

claude · 2026-02-20T01:19:50Z

scripts/debug/inspect_data.py

+                    text = examples[doc_idx]["text"]
+                    total = len(text.split("\n")) * 2
+                    scroll_offset = min(scroll_offset + body_h, max(0, total))


This rough estimate (newlines * 2) diverges from the actual wrapping logic in draw() which uses textwrap.wrap(line, w - 1). Page Down can overshoot or undershoot the real content. Consider computing the wrapped line count from the same textwrap logic used in draw, or caching it.

XenonMolecule

Overall this is great!!! And I LOVE the interactive viewer. I had a few errors when I ran the non-interactive mode on my data, so just wanted to flag the changes I had to make to get it to work, but all around was a super useful tool!

XenonMolecule · 2026-02-20T08:18:15Z

scripts/debug/inspect_data.py

+
+    # When submitted as a Ray job, the script runs on the cluster without --cluster.
+    # Detect this via RAY_JOB_ID which Ray sets automatically for submitted jobs.
+    on_cluster_node = os.environ.get("RAY_JOB_ID") is not None


Is this always set when we run on the cluster? When I launched with

uv run scripts/debug/inspect_data.py experiments/exp_qwen3_0_6b_rephraser_sft.py \ --step 736 --cluster us-central1 -o step_736.jsonl

this line was causing me trouble. I fixed by moving to

on_cluster_node = os.environ.get("RAY_JOB_ID") is not None or os.environ.get("MARIN_PREFIX") is not None

XenonMolecule · 2026-02-20T08:19:24Z

scripts/debug/inspect_data.py

+    for idx in indices:
+        block_id = idx // dataset.block_size
+        index_within_block = idx % dataset.block_size
+        block = dataset._get_block(block_id)


Maybe this should be:

blocking_wait(dataset._get_block(block_id)) instead of dataset._get_block(block_id)

XenonMolecule · 2026-02-20T08:26:47Z

scripts/debug/inspect_data.py

+    for i, (ex, src) in enumerate(zip(examples, sources, strict=True)):
+        tokens = ex.tokens.tolist()
+        lw = ex.loss_weight
+        pct_masked = float((lw == 0).sum()) / len(lw) * 100


tokens = ex.tokens.array.tolist() lw = ex.loss_weight.array pct_masked = float((lw == 0).sum()) / lw.size * 100

At least when I ran, LmExample.tokens and LmExample.loss_weight are Haliax NamedArrays, not raw JAX/numpy arrays. I needed .array to unwrap to the underlying JAX array before calling .tolist(). And JAX arrays don't support len() so I used .size instead.

XenonMolecule · 2026-02-20T08:43:27Z

scripts/debug/inspect_data.py

+            out.close()
+
+    if output:
+        click.echo(f"Wrote {total_examples} examples to {output}")


Consider:

gcs_output = os.path.join(step_output_path, "debug", output) fs = gcsfs.GCSFileSystem() fs.put(output, gcs_output) click.echo(f"Wrote {total_examples} examples to {gcs_output}") click.echo(f"Download with: gsutil cp {gcs_output} .")

The output file was previously written to the Ray worker's ephemeral local disk and lost after the job exited. This uploads to GCS under {step_output_path}/debug/, co-located with the training run's artifacts and collision-free via Marin's {name}-{hash} path convention.

Data Viewer

fc7bee4

Helw150 requested review from XenonMolecule and dlwh February 20, 2026 01:17

claude bot reviewed Feb 20, 2026

View reviewed changes

Helw150 added 2 commits February 20, 2026 01:27

Claude PR fixes

f7c7fd9

Fix Lint

e591f84

dlwh approved these changes Feb 20, 2026

View reviewed changes

XenonMolecule approved these changes Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Data Inspection Tool#2904

Data Inspection Tool#2904
Helw150 wants to merge 3 commits intomainfrom
will/inspect-debug-data

Helw150 commented Feb 20, 2026

Uh oh!

claude bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

claude bot Feb 20, 2026

Uh oh!

claude bot Feb 20, 2026

Uh oh!

XenonMolecule left a comment

Uh oh!

XenonMolecule Feb 20, 2026

Uh oh!

XenonMolecule Feb 20, 2026

Uh oh!

XenonMolecule Feb 20, 2026

Uh oh!

XenonMolecule Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		config = _resolve_cluster_config(cluster)

		parts = ["python", "scripts/training/inspect_data.py", experiment]

	parts = ["python", "scripts/training/inspect_data.py", experiment]
	parts = ["python", "scripts/debug/inspect_data.py", experiment]

Comments

Conversation

Helw150 commented Feb 20, 2026

Uh oh!

claude bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Data Inspection Tool

Bugs

Guidelines

Uh oh!

claude bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

XenonMolecule left a comment

Choose a reason for hiding this comment

Uh oh!

XenonMolecule Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

XenonMolecule Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

XenonMolecule Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

XenonMolecule Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Feb 20, 2026 •

edited

Loading