Skip to content

Comments

Add Nemotron Pretraining Code v1 to pretraining dataset registry#2751

Open
dlwh wants to merge 1 commit intomainfrom
codex/add-nemotron-pretraining-code-v1
Open

Add Nemotron Pretraining Code v1 to pretraining dataset registry#2751
dlwh wants to merge 1 commit intomainfrom
codex/add-nemotron-pretraining-code-v1

Conversation

@dlwh
Copy link
Member

@dlwh dlwh commented Feb 11, 2026

This adds wiring for nvidia/Nemotron-Pretraining-Code-v1 in the pretraining dataset registry so it can be downloaded and tokenized through the standard experiments/pretraining_datasets CLI.

The new dataset entry targets the Synthetic-Code config (the only config with a text field) and excludes the metadata-only config by restricting HF download globs to Synthetic-Code/*.parquet plus dataset docs.

Changes:

  • add downloads["nemotron_pretraining_code_v1"] in experiments/pretraining_datasets/simple.py
  • add tokenized["nemotron_pretraining_code_v1"] in experiments/pretraining_datasets/simple.py
  • register nemotron_pretraining_code_v1 in experiments/pretraining_datasets/__init__.py

Validation:

  • uv run experiments/pretraining_datasets/main.py list
  • ./infra/pre-commit.py experiments/pretraining_datasets/simple.py experiments/pretraining_datasets/__init__.py
  • uv run python - <<'PY' ... get_steps(['nemotron_pretraining_code_v1'], download=True, tokenize=True) ... PY

@dlwh dlwh marked this pull request as ready for review February 19, 2026 06:56
@claude
Copy link
Contributor

claude bot commented Feb 19, 2026

Claude finished @dlwh's task in 43s —— View job


Review: Add Nemotron Pretraining Code v1

Adds download and tokenize wiring for nvidia/Nemotron-Pretraining-Code-v1 to the pretraining dataset registry. The change follows existing patterns correctly and fulfills its stated objectives.

  • Gather context and read changed files
  • Verify patterns against existing dataset entries
  • Check hf_urls_glob usage and text_key default

Findings: No issues found.

The implementation is consistent with other simple dataset entries:

  • hf_urls_glob restricts downloads to Synthetic-Code/*.parquet plus docs, which correctly avoids the metadata-only config.
  • .cd("Synthetic-Code") scopes the tokenizer to the right subdirectory.
  • The default TextLmDatasetFormat(text_key="text") matches the Synthetic-Code schema.
  • The __init__.py registry entry follows the exact same structure as the surrounding entries.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant