Conversation
Helw150
commented
Feb 20, 2026
- Add scripts/training/inspect_data.py — a CLI tool to inspect decoded training data at a given step on a Ray cluster, useful for debugging data pipelines, mixture weights, and masking
|
Claude finished @Helw150's task in 2m 1s —— View job Review: Data Inspection Tool
This PR adds a CLI tool ( Specification check: This PR is 587 lines with no linked specification or design doc. Per Bugs
Guidelines
|
scripts/debug/inspect_data.py
Outdated
|
|
||
| config = _resolve_cluster_config(cluster) | ||
|
|
||
| parts = ["python", "scripts/training/inspect_data.py", experiment] |
There was a problem hiding this comment.
Bug: This path doesn't match the actual file location. The file is at scripts/debug/inspect_data.py, so the Ray job will fail to find the entrypoint.
| parts = ["python", "scripts/training/inspect_data.py", experiment] | |
| parts = ["python", "scripts/debug/inspect_data.py", experiment] |
scripts/debug/inspect_data.py
Outdated
| text = examples[doc_idx]["text"] | ||
| total = len(text.split("\n")) * 2 | ||
| scroll_offset = min(scroll_offset + body_h, max(0, total)) |
There was a problem hiding this comment.
This rough estimate (newlines * 2) diverges from the actual wrapping logic in draw() which uses textwrap.wrap(line, w - 1). Page Down can overshoot or undershoot the real content. Consider computing the wrapped line count from the same textwrap logic used in draw, or caching it.
XenonMolecule
left a comment
There was a problem hiding this comment.
Overall this is great!!! And I LOVE the interactive viewer. I had a few errors when I ran the non-interactive mode on my data, so just wanted to flag the changes I had to make to get it to work, but all around was a super useful tool!
|
|
||
| # When submitted as a Ray job, the script runs on the cluster without --cluster. | ||
| # Detect this via RAY_JOB_ID which Ray sets automatically for submitted jobs. | ||
| on_cluster_node = os.environ.get("RAY_JOB_ID") is not None |
There was a problem hiding this comment.
Is this always set when we run on the cluster? When I launched with
uv run scripts/debug/inspect_data.py experiments/exp_qwen3_0_6b_rephraser_sft.py \
--step 736 --cluster us-central1 -o step_736.jsonl
this line was causing me trouble. I fixed by moving to
on_cluster_node = os.environ.get("RAY_JOB_ID") is not None or os.environ.get("MARIN_PREFIX") is not None
| for idx in indices: | ||
| block_id = idx // dataset.block_size | ||
| index_within_block = idx % dataset.block_size | ||
| block = dataset._get_block(block_id) |
There was a problem hiding this comment.
Maybe this should be:
blocking_wait(dataset._get_block(block_id)) instead of dataset._get_block(block_id)
| for i, (ex, src) in enumerate(zip(examples, sources, strict=True)): | ||
| tokens = ex.tokens.tolist() | ||
| lw = ex.loss_weight | ||
| pct_masked = float((lw == 0).sum()) / len(lw) * 100 |
There was a problem hiding this comment.
tokens = ex.tokens.array.tolist()
lw = ex.loss_weight.array
pct_masked = float((lw == 0).sum()) / lw.size * 100
At least when I ran, LmExample.tokens and LmExample.loss_weight are Haliax NamedArrays, not raw JAX/numpy arrays. I needed .array to unwrap to the underlying JAX array before calling .tolist(). And JAX arrays don't support len() so I used .size instead.
| out.close() | ||
|
|
||
| if output: | ||
| click.echo(f"Wrote {total_examples} examples to {output}") |
There was a problem hiding this comment.
Consider:
gcs_output = os.path.join(step_output_path, "debug", output)
fs = gcsfs.GCSFileSystem()
fs.put(output, gcs_output)
click.echo(f"Wrote {total_examples} examples to {gcs_output}")
click.echo(f"Download with: gsutil cp {gcs_output} .")
The output file was previously written to the Ray worker's ephemeral local disk and lost after the job exited. This uploads to GCS under {step_output_path}/debug/, co-located with the training run's artifacts and collision-free via Marin's {name}-{hash} path convention.