You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,20 +43,20 @@ Prepare two CSV files. Each row is one proxy run from your swarm, and the two fi
43
43
44
44
The domain column names in `ratios.csv` and the metric column names in `metrics.csv` can be anything — Olmix derives them automatically from the CSV headers. The following columns are treated as metadata and skipped during fitting: `run` (or `run_id`) — the required ID column used to join the two files; `name` — an optional human-readable label; `index` — an optional sequential index; and any unnamed row-index columns (e.g., added by pandas on export). Only `run` or `run_id` is required.
45
45
46
-
### Fit config
46
+
### How to run
47
47
48
-
`olmix fit` is configured via a YAML file. Run it with:
48
+
`olmix fit` is configured via a YAML file containing `ratios.csv` and `metrics.csv`. Run it with:
49
49
50
50
```bash
51
-
olmix fit --config configs/fits/dclm_baseline.yaml --output-dir output/my_fit
51
+
olmix fit --config configs/examples/fit/example.yaml --output-dir output/my_fit
52
52
```
53
53
54
54
| Flag | Description |
55
55
|------|-------------|
56
56
|`--config`| Path to the YAML fit configuration file |
57
57
|`--output-dir`| Directory for saving fit outputs |
58
58
59
-
See [`configs/fits/dclm_baseline.yaml`](configs/fits/dclm_baseline.yaml) for a full example. The config has these sections:
59
+
See [`configs/examples/fit/example.yaml`](configs/examples/fit/example.yaml) for a full example. The config has these sections:
60
60
61
61
```yaml
62
62
swarm:
@@ -114,7 +114,6 @@ Only `swarm` and `priors` are required. All other sections are optional and fall
114
114
|-------|-------------|
115
115
| `relative_sizes` | Fractional weight of each domain in the natural corpus (should sum to ~1.0). Defines the prior distribution used as the KL regularization target in the proposer. |
116
116
| `token_counts` | Absolute token count per domain. Used for repetition constraint. |
117
-
| `total_tokens` | (Optional) Total token budget across all domains, equal to the sum across `token_counts`. |
118
117
119
118
#### `eval` section
120
119
@@ -182,90 +181,28 @@ The key output is `opt_avg_all_metrics_*_optimal.json` — the single set of wei
182
181
183
182
---
184
183
185
-
## Part 2: Launching swarms and fitting from W&B
184
+
## Part 2: Generating swarm mixtures
186
185
187
-
Once you're comfortable with the fitting workflow above, you can use olmix end-to-end: generate candidate mixtures, launch proxy training runs on Beaker, and fit directly from the W&B results.
188
-
189
-
The workflow uses two separate configs:
190
-
191
-
- **`GenerationConfig`** — controls how mixes are sampled (data sources, priors, swarm parameters, token budget). See [`configs/generations/example.yaml`](configs/generations/example.yaml).
192
-
- **`LaunchConfig`** — controls how training runs are launched (infra, training hyperparams, eval, **mix**). See configs in [`configs/experiments/`](configs/experiments/).
193
-
194
-
Every `LaunchConfig` requires an explicit top-level `mix` field that maps domain keys to weights and repetition factors. The `data.sources` section describes *what data exists*; the `mix` section describes *how much of each domain to use*.
195
-
196
-
The `mix` supports two formats — **nested** (recommended for hand-written configs) and **flat** (used by generated configs). Both are equivalent; nested mixes are auto-flattened on load.
197
-
198
-
**Nested format** — mirrors the source/topic/quality hierarchy. Weights at each level are multiplied to get the final leaf weight. `repetition_factor` is inherited from the nearest ancestor that sets it:
199
-
200
-
```yaml
201
-
mix:
202
-
dclm:
203
-
weight: 0.8
204
-
repetition_factor: 1.0
205
-
science_math_and_technology:
206
-
weight: 0.25
207
-
repetition_factor: 1.0
208
-
software_development:
209
-
weight: 0.625
210
-
repetition_factor: 1.0
211
-
education_and_jobs:
212
-
weight: 0.125
213
-
repetition_factor: 1.0
214
-
wikipedia:
215
-
weight: 0.1
216
-
repetition_factor: 2.0
217
-
arxiv:
218
-
weight: 0.1
219
-
repetition_factor: 1.5
220
-
```
221
-
222
-
For quality-level nesting:
223
-
224
-
```yaml
225
-
mix:
226
-
all_dressed:
227
-
weight: 0.98
228
-
repetition_factor: 1.0
229
-
science:
230
-
weight: 0.20
231
-
high: { weight: 0.70, repetition_factor: 1.0 }
232
-
med: { weight: 0.30, repetition_factor: 1.0 }
233
-
code:
234
-
weight: 0.50
235
-
high: { weight: 0.70, repetition_factor: 1.0 }
236
-
med: { weight: 0.30, repetition_factor: 1.0 }
237
-
arxiv:
238
-
weight: 0.02
239
-
repetition_factor: 1.5
240
-
```
241
-
242
-
**Flat format** — colon-separated domain keys, each with `weight` and `repetition_factor`. This is what `olmix generate` produces:
243
-
244
-
```yaml
245
-
mix:
246
-
dclm:science_math_and_technology:
247
-
weight: 0.2
248
-
repetition_factor: 1.0
249
-
dclm:software_development:
250
-
weight: 0.5
251
-
repetition_factor: 1.0
252
-
wikipedia:
253
-
weight: 0.1
254
-
repetition_factor: 2.0
255
-
```
186
+
`olmix generate` samples a swarm of mixtures from a `GenerationConfig` YAML and writes each one as a `LaunchConfig` file used for training the proxy models. A key capability supported by `olmix generate` is **mixture reuse**: freeze the relative topic weights within the swarm. See [`configs/examples/generate/example.yaml`](configs/examples/generate/example.yaml) for a basic `GenerationConfig` and [`configs/examples/generate/partial_mixture_reuse.yaml`](configs/examples/generate/partial_mixture_reuse.yaml) for a partial mixture reuse example.
256
187
257
188
### Step 0: Compute priors (token counts)
258
189
259
-
Before generating mixes, compute the token counts for your data sources. This scans S3 paths and outputs the `priors` block for your generation config:
190
+
Before generating mixes, set the priors for the data paths in your config. There are two fields: 1. **relative_sizes**, which is used as the Dirichlet prior and 2. **token_counts**, which is used to enforce repetition constraints on the swarm (by default, we ensure no data is repeated at the proxy model scale). These priors can be set manually or computed automatically using `olmix priors compute` to be the natural distribution and the actual sizes of the data paths:
Copy the output into your generation config's `priors:` section. Use `--output priors.yaml` to write to a file instead. Results are cached in `cache/` for subsequent runs; use `--no-cache` to force a fresh scan.
278
-
279
214
### Step 1: Generate candidate mixtures
280
215
281
-
Use `olmix generate` to sample mixture variants from a generation config. The `--base` flag provides a launch config template, and each variant is written as a self-contained launch config YAML file — ready to submit directly.
216
+
Use `olmix generate` to sample mixture variants from a generation config. The `--base` flag provides a `LaunchConfig` template (infra, training, eval settings); each variant inherits from it and gets a unique sampled `mix` written into it.
This produces one YAML file per variant in the output directory:
225
+
This produces one self-contained `LaunchConfig` YAML per variant:
291
226
292
227
```
293
228
output/my_variants/
294
229
example-swarm-a1b2c3d4-0000.yaml
295
230
example-swarm-a1b2c3d4-0001.yaml
296
-
example-swarm-a1b2c3d4-0002.yaml
297
-
example-swarm-a1b2c3d4-0003.yaml
231
+
...
298
232
```
299
233
300
-
Each variant file is a complete launch config with infra, training, data, eval, and the sampled mix:
234
+
Inspect and edit these files before launching — this is where you have full control over what gets trained.
301
235
302
-
```yaml
303
-
name: example-swarm-a1b2c3d4-0000
304
-
description: Data proportions experiment - balanced baseline mix
305
-
infra:
306
-
budget: ai2/oe-base
307
-
cluster: ai2/jupiter
308
-
# ...
309
-
training:
310
-
proxy_model_id: olmo3_14m
311
-
# ...
312
-
data:
313
-
sources:
314
-
- name: dclm
315
-
topics:
316
-
- name: science_math_and_technology
317
-
paths:
318
-
- s3://...
319
-
- name: wikipedia
320
-
paths:
321
-
- s3://...
322
-
eval:
323
-
tasks: { ... }
324
-
mix:
325
-
dclm:science_math_and_technology:
326
-
weight: 0.55
327
-
wikipedia:
328
-
weight: 0.10
329
-
group_id: a1b2c3d4
330
-
```
236
+
### Step 2: Launch a swarm
331
237
332
-
Inspect and edit these files before launching — this is the point where you have full control over what gets trained.
238
+
```bash
239
+
olmix launch run --variants output/my_variants/
240
+
```
333
241
334
-
### Step 2: Preview training commands
242
+
Submits one training job per variant. Each job trains a proxy model on its mixture and logs eval metrics to W&B under a shared group ID. Use `--dry-run` to generate metadata without launching.
335
243
336
-
Renders the full OLMo training command for each variant. The `--variants` flag accepts a directory of configs or a single config file. Prints to stdout without launching anything.
olmix launch preview --variants configs/experiments/data_proportions/mix_heavy_code.yaml # single file
341
-
```
246
+
Once runs complete, export ratios and metrics to CSV files (e.g. from W&B), then fit using the workflow in [Part 1](#part-1-mixture-optimization-from-csv-data).
342
247
343
-
### Step 3: Launch a swarm
248
+
### GenerationConfig reference
344
249
345
-
Submits one Beaker job per variant. Each job trains a proxy model on its mixture and logs eval metrics to W&B under a shared group ID. Launch metadata is saved in the variants directory.
250
+
```yaml
251
+
name: my-swarm
346
252
347
-
```bash
348
-
olmix launch run --variants output/my_variants/
349
-
olmix launch run --variants configs/experiments/data_proportions/mix_heavy_code.yaml # single file
253
+
data: # What data sources exist and how they're organized
254
+
priors: # Natural token distribution at the leaf level (from olmix priors compute)
255
+
swarm: # Sampling parameters
256
+
max_tokens: # Token budget per proxy run
350
257
```
351
258
352
-
Use `--dry-run` to generate the metadata JSON without launching any jobs.
259
+
#### `data`
353
260
354
-
### Step 4: Export to CSV and fit
261
+
`data.sources` lists data pools in a hierarchy: **source → topic → quality**. Each source specifies exactly one of `paths` (flat source), `topics`, or `quality`.
355
262
356
-
Once the swarm runs complete, export the ratios and metrics to CSV files (e.g. from W&B), then fit using the YAML config workflow described in [Part 1](#part-1-fitting-from-csv-data):
263
+
An optional `weight` field can appear on any topic, which pins its share within that source's allocation (values within a source should sum to ~1.0). Anything without a `weight` is sampled from the Dirichlet and varies freely across runs. This is the **mixture reuse** pattern: freeze the existing ratios, and only recompute on affected domains. Note that topics and sources are used here relatively; for example, the aggregated virtual domain `existing` is a source while `wikipedia` is a topic within it:
357
264
358
-
```bash
359
-
olmix fit --config configs/fits/my_config.yaml --output-dir output/my_fit
265
+
```yaml
266
+
data:
267
+
sources:
268
+
- name: existing
269
+
topics:
270
+
- name: dclm:science_math_and_technology
271
+
paths: [...]
272
+
weight: 0.55 # frozen from prior optimization
273
+
- name: dclm:software_development
274
+
paths: [...]
275
+
weight: 0.30 # frozen
276
+
- name: dclm:entertainment
277
+
paths: [...] # no weight → sampled freely in each variant
278
+
weight: 0.1
279
+
- name: wikipedia
280
+
paths: [...]
281
+
weight: 0.05
282
+
- name: stack-edu
283
+
topics:
284
+
- name: Python
285
+
paths: [...] # free to vary
286
+
- name: Java
287
+
paths: [...] # free to vary
360
288
```
361
289
290
+
For this example, the domains to recompute are `existing`, `stack-edu:Python`, and `stack-edu:Java`.
291
+
292
+
#### `priors`
293
+
294
+
Must be at the **leaf level** (e.g. `dclm:science_math_and_technology`, not `dclm`). `relative_sizes` defines the Dirichlet prior center for free domains; `token_counts` enforces the repetition constraint (no domain sampled past `repetition_factor` × its available data).
295
+
296
+
#### `swarm`
297
+
298
+
| Field | Description | Default |
299
+
|-------|-------------|---------|
300
+
| `variants` | Number of mixture variants to generate | `1` |
301
+
| `seed` | Random seed | `42` |
302
+
| `min_strength` / `max_strength` | Dirichlet concentration range. Low = diverse/extreme mixes; high = mixes near the prior | `0.1` / `5.0` |
Copy file name to clipboardExpand all lines: configs/examples/fit/dclm_per_family.yaml
+1-3Lines changed: 1 addition & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,4 @@
1
-
# DCLM baseline fit config
2
-
# Fits regression models against 127 DCLM swarm runs across 24 DataDelve domains
1
+
# Config for fitting one regression model per family, corresponding to RQ5, Table 3 (middle row) Section 3.3.3 in Olmix https://arxiv.org/abs/2602.12237
0 commit comments