Add Nemotron Pretraining Code v1 to pretraining dataset registry by dlwh · Pull Request #2751 · marin-community/marin

dlwh · 2026-02-11T21:50:30Z

This adds wiring for nvidia/Nemotron-Pretraining-Code-v1 in the pretraining dataset registry so it can be downloaded and tokenized through the standard experiments/pretraining_datasets CLI.

The new dataset entry targets the Synthetic-Code config (the only config with a text field) and excludes the metadata-only config by restricting HF download globs to Synthetic-Code/*.parquet plus dataset docs.

Changes:

add downloads["nemotron_pretraining_code_v1"] in experiments/pretraining_datasets/simple.py
add tokenized["nemotron_pretraining_code_v1"] in experiments/pretraining_datasets/simple.py
register nemotron_pretraining_code_v1 in experiments/pretraining_datasets/__init__.py

Validation:

uv run experiments/pretraining_datasets/main.py list
./infra/pre-commit.py experiments/pretraining_datasets/simple.py experiments/pretraining_datasets/__init__.py
uv run python - <<'PY' ... get_steps(['nemotron_pretraining_code_v1'], download=True, tokenize=True) ... PY

claude · 2026-02-19T06:57:01Z

Claude finished @dlwh's task in 43s —— View job

Review: Add Nemotron Pretraining Code v1

Adds download and tokenize wiring for nvidia/Nemotron-Pretraining-Code-v1 to the pretraining dataset registry. The change follows existing patterns correctly and fulfills its stated objectives.

Gather context and read changed files
Verify patterns against existing dataset entries
Check hf_urls_glob usage and text_key default

Findings: No issues found.

The implementation is consistent with other simple dataset entries:

hf_urls_glob restricts downloads to Synthetic-Code/*.parquet plus docs, which correctly avoids the metadata-only config.
.cd("Synthetic-Code") scopes the tokenizer to the right subdirectory.
The default TextLmDatasetFormat(text_key="text") matches the Synthetic-Code schema.
The __init__.py registry entry follows the exact same structure as the surrounding entries.

LGTM.

Add Nemotron Pretraining Code v1 dataset wiring

cf5c1ed

dlwh marked this pull request as ready for review February 19, 2026 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add Nemotron Pretraining Code v1 to pretraining dataset registry#2751

Add Nemotron Pretraining Code v1 to pretraining dataset registry#2751
dlwh wants to merge 1 commit intomainfrom
codex/add-nemotron-pretraining-code-v1

dlwh commented Feb 11, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

dlwh commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Add Nemotron Pretraining Code v1

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dlwh commented Feb 11, 2026 •

edited

Loading

claude bot commented Feb 19, 2026 •

edited

Loading