First Attempt at DPO by ahmeda14960 · Pull Request #2460 · marin-community/marin

ahmeda14960 · 2026-01-24T20:09:18Z

I had codex write a DPO implementation and then claude double check / simplify. From looking @DPO_CLAUDE.md file nothing seems obviously wrong to me, this reference model + policy model business is quite strange

Copilot

Pull request overview

This PR implements Direct Preference Optimization (DPO) training support for Levanter/Marin. The implementation was generated by Codex and reviewed/simplified by Claude, with comprehensive documentation provided in DPO_claude.md explaining the rationale for all changes.

Changes:

Added complete DPO training implementation with policy and reference models
Extended data processing to handle preference chat datasets
Added minimal but necessary Haliax changes to handle NamedArray as leaf nodes in tree operations
Added comprehensive test coverage for all new functionality

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
lib/levanter/src/levanter/main/train_dpo.py	Core DPO training loop with loss computation and model management (515 new lines)
lib/levanter/src/levanter/data/text.py	PreferenceChatProcessor and PreferencePairDataset for handling preference data
lib/levanter/src/levanter/data/packing.py	Added "drop" slice_strategy for sequences exceeding max length
lib/levanter/src/levanter/trainer_state.py	Uses is_leaf pattern with NamedArray for correct partition/combine operations
lib/haliax/src/haliax/quantization.py	Treats NamedArray as leaf in partition/apply_updates operations
lib/haliax/src/haliax/partitioning.py	Handles NamedArray with array=None and batch_dim from vmap
lib/haliax/src/haliax/nn/scan.py	Adds auto_sharded after vmap for memory efficiency with stacked layers
lib/levanter/tests/test_dpo.py	Comprehensive test suite (372 lines) covering all DPO functionality
lib/marin/src/marin/training/training.py	Marin integration for DPO training via TrainDpoOnPodConfig
lib/marin/src/marin/transform/conversation/transform_preference_data.py	Preference dataset transformation with fsspec support
experiments/exp2101_dpo_ultrafeedback.py	Example experiment using DPO on Ultrafeedback dataset
lib/levanter/config/dpo_ultrafeedback_llama3_8b.yaml	Production-ready DPO configuration
lib/levanter/config/dpo_tiny_gpt2.yaml	Minimal DPO test configuration

Copilot · 2026-01-24T20:20:15Z

experiments/exp2101_dpo_ultrafeedback.py

+    tokenized=tokenized_preferences,
+    model_config=llama_8b,
+    dpo_config=dpo_config,
+    tags=["ultrafeedback", "llama3", "simpo"],


The tags list includes "simpo" but this appears to be a DPO (Direct Preference Optimization) implementation, not SimPO (Simple Preference Optimization). If this is intentional (perhaps planning to implement SimPO later), consider adding a comment to clarify. Otherwise, remove the "simpo" tag.

Suggested change

tags=["ultrafeedback", "llama3", "simpo"],

tags=["ultrafeedback", "llama3"],

lib/levanter/src/levanter/main/train_dpo.py

dlwh · 2026-01-25T17:25:27Z

experiments/defaults.py

+    pretraining_data = dataclasses.replace(pretraining_data, permutation_type="feistel")
+    vocab_size = _get_vocab_size(pretraining_data)
+
+    if len(name) > 64:


can we extract a helper for this since we use it for training too

.agents/projects/dpo_levanter.md

dlwh · 2026-01-25T17:27:11Z

lib/haliax/src/haliax/partitioning.py

            # this happens when we filter out params for things like lora.
            # could use eqx.partition to avoid this, but eh
            return named
+        if getattr(named.array, "batch_dim", None) is not None:


i hate this. i need a minimum reproducer so i can make this go away

can't we revert this

lib/levanter/src/levanter/data/text.py

lib/marin/src/marin/transform/conversation/transform_preference_data.py

dlwh · 2026-01-26T20:54:11Z

experiments/simple_dpo_config.py

+    weight_decay: float = 0.0
+    warmup: float = 0.03
+    cooldown: float | None = None
+    lr_schedule: str = "linear"


is this what people do for dpo

linear yes, the particular value of warm up here I'm not sure but it's probably good to make it close to train lm

removed warmup

dlwh · 2026-01-26T21:10:13Z

lib/marin/src/marin/training/training.py

+
+    Note that trainer.id and the RUN_ID env variable take precedence, in that order.
+    """
+    allow_out_of_region: tuple[str, ...] = ()


let's not allow this until we really need it

I was inheriting from TrainLMPod... do we want to get rid of that too? I can do that

together with #2463 should avoid a lot of the noisy changes in #2460/#2462

- Update train_dpo.py imports to use LmDataConfig instead of SingleDatasetLMConfig - Migrate to components-based data config structure - Replace text.py with text/ package structure (from simpo) - Add preference.py with DPO-specific classes - Update DPO YAML configs to use components: structure - Merge validation split functions into single _build_validation_split - Copy updated Levanter main scripts from simpo (train_lm.py, eval_lm.py, etc.) - Copy updated marin tokenize files from simpo Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Copy all config files from simpo (updated to components: structure) - Copy updated source files from simpo (trainer_state.py, optim/, etc.) - Add EpochDataset class to dataset.py for DPO training - Update text/__init__.py exports for preference functions - Add SimPO config files from simpo branch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Keep preference format handling in datasets.py and formats.py - Keep DPO exports in text/__init__.py - Accept main's partitioning.py changes (use axis_names) - Restore EpochDataset class in dataset.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… branch - tokenizer: marin-community/marin-tokenizer - train_batch_size: 128, num_train_steps: 2150 - learning_rate: 5e-7, lr_schedule: cosine, warmup: 0.1 - beta: 0.01 - Add both train and validation components with proper cache dirs - Use GCS model paths for reference_model_path and initialize_from_hf - validation_split_fraction: null (uses separate validation component) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dlwh · 2026-02-05T18:28:05Z

experiments/dpo_ultrafeedback.py

@@ -0,0 +1,96 @@
+# Copyright 2025 The Marin Authors


in keeping with new policy i'm not gonna review this unless you want me to

that said, can you $AGENT up a doc that explains how to run dpo in the Levanter and Marin settings (probably two docs, one for marin and one for levanter)

dlwh · 2026-02-05T18:29:36Z

lib/haliax/src/haliax/partitioning.py

            # this happens when we filter out params for things like lora.
            # could use eqx.partition to avoid this, but eh
            return named
+        if getattr(named.array, "batch_dim", None) is not None:


can't we revert this

dlwh · 2026-02-05T18:31:20Z

lib/levanter/src/levanter/data/text/datasets.py

        )
    else:
+        # Check for preference format (imported lazily to avoid circular imports)
+        from .preference import PreferenceChatLmDatasetFormat, dataset_for_preference_format


preference datasets should not be part of text datasets. they are a different structure (two sequences instead of one) and so need to be a different type.

dlwh · 2026-02-05T19:23:58Z

lib/levanter/src/levanter/main/train_dpo.py

+    return model.policy if isinstance(model, DpoModel) else model
+
+
+def _bool_tree_like(tree, value: bool):


in theory any tree prefix should work so you should be able to return value directly but maybe i'm wrong

TIL thank you! looks a lot cleaner

dlwh · 2026-02-05T19:24:24Z

lib/levanter/src/levanter/main/train_dpo.py

+    *,
+    beta: float,
+) -> tuple[jnp.ndarray, dict[str, Metric]]:
+    if isinstance(delta_pi, hax.NamedArray) or isinstance(delta_ref, hax.NamedArray):


do we need to make this this defensive

im just gonna make everything named array

dlwh · 2026-02-05T19:25:45Z

lib/levanter/src/levanter/main/train_dpo.py

+    nll = model.compute_next_token_loss(example, reduction=None, reduction_axis=(), key=key)
+    Pos = example.tokens.resolve_axis("position")
+    return -hax.sum(nll, axis=Pos)


actually why aren't we just doing

Suggested change

nll = model.compute_next_token_loss(example, reduction=None, reduction_axis=(), key=key)

Pos = example.tokens.resolve_axis("position")

return -hax.sum(nll, axis=Pos)

nll = model.compute_next_token_loss(example, reduction=hax.sum, reduction_axis="position", key=key)

return -nll

dlwh · 2026-02-05T19:35:38Z

lib/levanter/src/levanter/main/train_dpo.py

+    if cache is None:
+        raise ValueError(f"No training cache available for component {name}.")
+
+    if not isinstance(component.format, PreferenceChatLmDatasetFormat):


we should change the code until this isn't possible (i.e. not a textdataset)

dlwh · 2026-02-05T20:26:52Z

lib/levanter/src/levanter/main/train_dpo.py

+        loss, metrics = dpo_loss_from_logps(delta_pi, delta_ref, beta=config.beta)
+        chosen_reward = (logp_pi_chosen - logp_ref_chosen) * config.beta
+        rejected_reward = (logp_pi_rejected - logp_ref_rejected) * config.beta
+        if isinstance(chosen_reward, hax.NamedArray):


can we standardize on named or unnamed

standardizing on named

dlwh · 2026-02-05T20:29:11Z

lib/levanter/src/levanter/main/train_dpo.py

+        state = dataclasses.replace(state, model=None)
+        gc.collect()


this is wrong and not preemption/resume safe. if you load a model from trainer.initial_state you need to stick iwth it unless step== 0

sorry not sure why this was added

dlwh · 2026-02-05T20:30:15Z

MONITOR_SIMPO.md

put in .agents/ or docs/ please

dlwh · 2026-02-19T22:28:17Z

lib/levanter/src/levanter/data/text/formats.py

+            # Check for preference format (imported lazily to avoid circular imports)
+            from .preference import PreferenceChatLmDatasetFormat, preprocessor_for_preference_format
+
+            if isinstance(format, PreferenceChatLmDatasetFormat):
+                return preprocessor_for_preference_format(format, tokenizer)  # type: ignore


dlwh · 2026-02-19T22:29:54Z

lib/levanter/src/levanter/main/train_dpo.py

+    if cache is None:
+        raise ValueError(f"No training cache available for component {name}.")
+
+    if not isinstance(component.format, PreferenceChatLmDatasetFormat):


dlwh · 2026-02-19T22:31:28Z

lib/marin/src/marin/training/training.py

+
+    Note that trainer.id and the RUN_ID env variable take precedence, in that order.
+    """
+    allow_out_of_region: tuple[str, ...] = ()


dlwh

Codex review (by Codex): merged latest main into this DPO branch and resolved defaults.py conflict by keeping DPO defaults wiring while aligning tokenizer vocab-size resolution with current main utilities. I recommend re-running the full CI matrix due the large main-sync delta.

ahmeda14960 added 11 commits January 22, 2026 13:24

dpo init

3775267

wip dpo working somewhat

5218a2c

config

fd91d3c

Merge remote-tracking branch 'origin/main' into dpo

7a9f9fa

claude suggestions

4860174

fix scan

1e0550c

update

63da7f7

update dpo

9c7db7c

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

dedec64

update claude dpo branch

f5df1b7

wip

f4aca7c

ahmeda14960 requested review from Copilot and dlwh and removed request for Copilot January 24, 2026 20:09

Copilot started reviewing on behalf of ahmeda14960 January 24, 2026 20:09 View session

wip

7278f64

Copilot AI review requested due to automatic review settings January 24, 2026 20:15

Copilot started reviewing on behalf of ahmeda14960 January 24, 2026 20:15 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

dlwh requested changes Jan 26, 2026

View reviewed changes

dlwh mentioned this pull request Jan 27, 2026

fix getting pspec for None-backed named arrays #2502

Merged

dlwh added a commit that referenced this pull request Jan 28, 2026

fix getting pspec for None-backed named arrays (#2502)

29ce8e7

together with #2463 should avoid a lot of the noisy changes in #2460/#2462

ahmeda14960 and others added 8 commits January 30, 2026 23:23

Fix trainer_state.py to match main (remove _fill_missing_namedarrays)

bfb9171

Fix missing Union import in train_dpo.py

2f6620b

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update test files from simpo branch

0de46ad

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

wip

f6b602a

ahmeda14960 added 6 commits January 31, 2026 10:14

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

acac2a1

track actualy single epoch

8889a8a

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

7b9ab19

Refactor: move is_path_like to marin.utils for reuse

8a572a9

add agent docs

2c02787

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

1ddd631

ahmeda14960 requested a review from dlwh February 2, 2026 07:13

ahmeda14960 added 3 commits February 2, 2026 14:09

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

0c8f9de

final touches

0d473dd

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

5e078ae

dlwh requested changes Feb 5, 2026

View reviewed changes

ahmeda14960 added 10 commits February 6, 2026 11:39

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

5ea87db

delete simpo tracker

0bc6365

do david's fix

7fb3990

david's changes

e1533ad

remove redudant memory loading logic

d46be03

should be clean now

ba1fc71

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

05cee85

more fixes

4cdb3d7

training not broken, now refactoring dataset

2817a32

more dpo updates

79736f1

ahmeda14960 requested a review from dlwh February 8, 2026 17:57

ahmeda14960 added 2 commits February 8, 2026 14:31

go back to mixture

d0058a9

Merge remote-tracking branch 'origin/main' into dpo_claude_opus

f22edd8

dlwh approved these changes Feb 19, 2026

View reviewed changes

Merge main into dpo_claude_opus

3f9a99e

dlwh reviewed Feb 20, 2026

View reviewed changes

	tags=["ultrafeedback", "llama3", "simpo"],
	tags=["ultrafeedback", "llama3"],

		return model.policy if isinstance(model, DpoModel) else model


		def _bool_tree_like(tree, value: bool):

Comments

Conversation

ahmeda14960 commented Jan 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment