Skip to content

Commit f4df506

Browse files
authored
Merge pull request #28 from allenai/mayeec/dev
Simplify swarm generation
2 parents 7c8c73b + 64da8a1 commit f4df506

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+2673
-848
lines changed

README.md

Lines changed: 85 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -43,20 +43,20 @@ Prepare two CSV files. Each row is one proxy run from your swarm, and the two fi
4343

4444
The domain column names in `ratios.csv` and the metric column names in `metrics.csv` can be anything — Olmix derives them automatically from the CSV headers. The following columns are treated as metadata and skipped during fitting: `run` (or `run_id`) — the required ID column used to join the two files; `name` — an optional human-readable label; `index` — an optional sequential index; and any unnamed row-index columns (e.g., added by pandas on export). Only `run` or `run_id` is required.
4545

46-
### Fit config
46+
### How to run
4747

48-
`olmix fit` is configured via a YAML file. Run it with:
48+
`olmix fit` is configured via a YAML file containing `ratios.csv` and `metrics.csv`. Run it with:
4949

5050
```bash
51-
olmix fit --config configs/fits/dclm_baseline.yaml --output-dir output/my_fit
51+
olmix fit --config configs/examples/fit/example.yaml --output-dir output/my_fit
5252
```
5353

5454
| Flag | Description |
5555
|------|-------------|
5656
| `--config` | Path to the YAML fit configuration file |
5757
| `--output-dir` | Directory for saving fit outputs |
5858

59-
See [`configs/fits/dclm_baseline.yaml`](configs/fits/dclm_baseline.yaml) for a full example. The config has these sections:
59+
See [`configs/examples/fit/example.yaml`](configs/examples/fit/example.yaml) for a full example. The config has these sections:
6060

6161
```yaml
6262
swarm:
@@ -114,7 +114,6 @@ Only `swarm` and `priors` are required. All other sections are optional and fall
114114
|-------|-------------|
115115
| `relative_sizes` | Fractional weight of each domain in the natural corpus (should sum to ~1.0). Defines the prior distribution used as the KL regularization target in the proposer. |
116116
| `token_counts` | Absolute token count per domain. Used for repetition constraint. |
117-
| `total_tokens` | (Optional) Total token budget across all domains, equal to the sum across `token_counts`. |
118117

119118
#### `eval` section
120119

@@ -182,90 +181,28 @@ The key output is `opt_avg_all_metrics_*_optimal.json` — the single set of wei
182181

183182
---
184183

185-
## Part 2: Launching swarms and fitting from W&B
184+
## Part 2: Generating swarm mixtures
186185

187-
Once you're comfortable with the fitting workflow above, you can use olmix end-to-end: generate candidate mixtures, launch proxy training runs on Beaker, and fit directly from the W&B results.
188-
189-
The workflow uses two separate configs:
190-
191-
- **`GenerationConfig`** — controls how mixes are sampled (data sources, priors, swarm parameters, token budget). See [`configs/generations/example.yaml`](configs/generations/example.yaml).
192-
- **`LaunchConfig`** — controls how training runs are launched (infra, training hyperparams, eval, **mix**). See configs in [`configs/experiments/`](configs/experiments/).
193-
194-
Every `LaunchConfig` requires an explicit top-level `mix` field that maps domain keys to weights and repetition factors. The `data.sources` section describes *what data exists*; the `mix` section describes *how much of each domain to use*.
195-
196-
The `mix` supports two formats — **nested** (recommended for hand-written configs) and **flat** (used by generated configs). Both are equivalent; nested mixes are auto-flattened on load.
197-
198-
**Nested format** — mirrors the source/topic/quality hierarchy. Weights at each level are multiplied to get the final leaf weight. `repetition_factor` is inherited from the nearest ancestor that sets it:
199-
200-
```yaml
201-
mix:
202-
dclm:
203-
weight: 0.8
204-
repetition_factor: 1.0
205-
science_math_and_technology:
206-
weight: 0.25
207-
repetition_factor: 1.0
208-
software_development:
209-
weight: 0.625
210-
repetition_factor: 1.0
211-
education_and_jobs:
212-
weight: 0.125
213-
repetition_factor: 1.0
214-
wikipedia:
215-
weight: 0.1
216-
repetition_factor: 2.0
217-
arxiv:
218-
weight: 0.1
219-
repetition_factor: 1.5
220-
```
221-
222-
For quality-level nesting:
223-
224-
```yaml
225-
mix:
226-
all_dressed:
227-
weight: 0.98
228-
repetition_factor: 1.0
229-
science:
230-
weight: 0.20
231-
high: { weight: 0.70, repetition_factor: 1.0 }
232-
med: { weight: 0.30, repetition_factor: 1.0 }
233-
code:
234-
weight: 0.50
235-
high: { weight: 0.70, repetition_factor: 1.0 }
236-
med: { weight: 0.30, repetition_factor: 1.0 }
237-
arxiv:
238-
weight: 0.02
239-
repetition_factor: 1.5
240-
```
241-
242-
**Flat format** — colon-separated domain keys, each with `weight` and `repetition_factor`. This is what `olmix generate` produces:
243-
244-
```yaml
245-
mix:
246-
dclm:science_math_and_technology:
247-
weight: 0.2
248-
repetition_factor: 1.0
249-
dclm:software_development:
250-
weight: 0.5
251-
repetition_factor: 1.0
252-
wikipedia:
253-
weight: 0.1
254-
repetition_factor: 2.0
255-
```
186+
`olmix generate` samples a swarm of mixtures from a `GenerationConfig` YAML and writes each one as a `LaunchConfig` file used for training the proxy models. A key capability supported by `olmix generate` is **mixture reuse**: freeze the relative topic weights within the swarm. See [`configs/examples/generate/example.yaml`](configs/examples/generate/example.yaml) for a basic `GenerationConfig` and [`configs/examples/generate/partial_mixture_reuse.yaml`](configs/examples/generate/partial_mixture_reuse.yaml) for a partial mixture reuse example.
256187

257188
### Step 0: Compute priors (token counts)
258189

259-
Before generating mixes, compute the token counts for your data sources. This scans S3 paths and outputs the `priors` block for your generation config:
190+
Before generating mixes, set the priors for the data paths in your config. There are two fields: 1. **relative_sizes**, which is used as the Dirichlet prior and 2. **token_counts**, which is used to enforce repetition constraints on the swarm (by default, we ensure no data is repeated at the proxy model scale). These priors can be set manually or computed automatically using `olmix priors compute` to be the natural distribution and the actual sizes of the data paths:
260191

261192
```bash
262-
olmix priors compute --config configs/generations/example.yaml
193+
olmix priors compute --config configs/examples/generate/example.yaml
263194
```
264195

265-
This outputs a YAML block you can paste directly into your generation config:
196+
This scans S3 paths and outputs a `priors:` block to paste into your generation config:
266197

267198
```yaml
268199
priors:
200+
relative_sizes:
201+
arxiv: 0.13859324268101414
202+
dclm:education_and_jobs: 0.13466673502770904
203+
dclm:science_math_and_technology: 0.5479947162541395
204+
dclm:software_development: 0.1548063887874921
205+
wikipedia: 0.023938917249645256
269206
token_counts:
270207
arxiv: 21377485731
271208
dclm:education_and_jobs: 20771836713
@@ -274,91 +211,106 @@ priors:
274211
wikipedia: 3692487830
275212
```
276213

277-
Copy the output into your generation config's `priors:` section. Use `--output priors.yaml` to write to a file instead. Results are cached in `cache/` for subsequent runs; use `--no-cache` to force a fresh scan.
278-
279214
### Step 1: Generate candidate mixtures
280215

281-
Use `olmix generate` to sample mixture variants from a generation config. The `--base` flag provides a launch config template, and each variant is written as a self-contained launch config YAML file — ready to submit directly.
216+
Use `olmix generate` to sample mixture variants from a generation config. The `--base` flag provides a `LaunchConfig` template (infra, training, eval settings); each variant inherits from it and gets a unique sampled `mix` written into it.
282217

283218
```bash
284219
olmix generate \
285-
--config configs/generations/example.yaml \
286-
--base configs/experiments/data_proportions/mix_baseline.yaml \
220+
--config configs/examples/generate/example.yaml \
221+
--base configs/examples/launch/data_proportions/mix_baseline.yaml \
287222
--output output/my_variants/
288223
```
289224

290-
This produces one YAML file per variant in the output directory:
225+
This produces one self-contained `LaunchConfig` YAML per variant:
291226

292227
```
293228
output/my_variants/
294229
example-swarm-a1b2c3d4-0000.yaml
295230
example-swarm-a1b2c3d4-0001.yaml
296-
example-swarm-a1b2c3d4-0002.yaml
297-
example-swarm-a1b2c3d4-0003.yaml
231+
...
298232
```
299233

300-
Each variant file is a complete launch config with infra, training, data, eval, and the sampled mix:
234+
Inspect and edit these files before launching — this is where you have full control over what gets trained.
301235

302-
```yaml
303-
name: example-swarm-a1b2c3d4-0000
304-
description: Data proportions experiment - balanced baseline mix
305-
infra:
306-
budget: ai2/oe-base
307-
cluster: ai2/jupiter
308-
# ...
309-
training:
310-
proxy_model_id: olmo3_14m
311-
# ...
312-
data:
313-
sources:
314-
- name: dclm
315-
topics:
316-
- name: science_math_and_technology
317-
paths:
318-
- s3://...
319-
- name: wikipedia
320-
paths:
321-
- s3://...
322-
eval:
323-
tasks: { ... }
324-
mix:
325-
dclm:science_math_and_technology:
326-
weight: 0.55
327-
wikipedia:
328-
weight: 0.10
329-
group_id: a1b2c3d4
330-
```
236+
### Step 2: Launch a swarm
331237

332-
Inspect and edit these files before launching — this is the point where you have full control over what gets trained.
238+
```bash
239+
olmix launch run --variants output/my_variants/
240+
```
333241

334-
### Step 2: Preview training commands
242+
Submits one training job per variant. Each job trains a proxy model on its mixture and logs eval metrics to W&B under a shared group ID. Use `--dry-run` to generate metadata without launching.
335243

336-
Renders the full OLMo training command for each variant. The `--variants` flag accepts a directory of configs or a single config file. Prints to stdout without launching anything.
244+
### Step 3: Export to CSV and fit
337245

338-
```bash
339-
olmix launch preview --variants output/my_variants/ # directory
340-
olmix launch preview --variants configs/experiments/data_proportions/mix_heavy_code.yaml # single file
341-
```
246+
Once runs complete, export ratios and metrics to CSV files (e.g. from W&B), then fit using the workflow in [Part 1](#part-1-mixture-optimization-from-csv-data).
342247

343-
### Step 3: Launch a swarm
248+
### GenerationConfig reference
344249

345-
Submits one Beaker job per variant. Each job trains a proxy model on its mixture and logs eval metrics to W&B under a shared group ID. Launch metadata is saved in the variants directory.
250+
```yaml
251+
name: my-swarm
346252
347-
```bash
348-
olmix launch run --variants output/my_variants/
349-
olmix launch run --variants configs/experiments/data_proportions/mix_heavy_code.yaml # single file
253+
data: # What data sources exist and how they're organized
254+
priors: # Natural token distribution at the leaf level (from olmix priors compute)
255+
swarm: # Sampling parameters
256+
max_tokens: # Token budget per proxy run
350257
```
351258

352-
Use `--dry-run` to generate the metadata JSON without launching any jobs.
259+
#### `data`
353260

354-
### Step 4: Export to CSV and fit
261+
`data.sources` lists data pools in a hierarchy: **source → topic → quality**. Each source specifies exactly one of `paths` (flat source), `topics`, or `quality`.
355262

356-
Once the swarm runs complete, export the ratios and metrics to CSV files (e.g. from W&B), then fit using the YAML config workflow described in [Part 1](#part-1-fitting-from-csv-data):
263+
An optional `weight` field can appear on any topic, which pins its share within that source's allocation (values within a source should sum to ~1.0). Anything without a `weight` is sampled from the Dirichlet and varies freely across runs. This is the **mixture reuse** pattern: freeze the existing ratios, and only recompute on affected domains. Note that topics and sources are used here relatively; for example, the aggregated virtual domain `existing` is a source while `wikipedia` is a topic within it:
357264

358-
```bash
359-
olmix fit --config configs/fits/my_config.yaml --output-dir output/my_fit
265+
```yaml
266+
data:
267+
sources:
268+
- name: existing
269+
topics:
270+
- name: dclm:science_math_and_technology
271+
paths: [...]
272+
weight: 0.55 # frozen from prior optimization
273+
- name: dclm:software_development
274+
paths: [...]
275+
weight: 0.30 # frozen
276+
- name: dclm:entertainment
277+
paths: [...] # no weight → sampled freely in each variant
278+
weight: 0.1
279+
- name: wikipedia
280+
paths: [...]
281+
weight: 0.05
282+
- name: stack-edu
283+
topics:
284+
- name: Python
285+
paths: [...] # free to vary
286+
- name: Java
287+
paths: [...] # free to vary
360288
```
361289

290+
For this example, the domains to recompute are `existing`, `stack-edu:Python`, and `stack-edu:Java`.
291+
292+
#### `priors`
293+
294+
Must be at the **leaf level** (e.g. `dclm:science_math_and_technology`, not `dclm`). `relative_sizes` defines the Dirichlet prior center for free domains; `token_counts` enforces the repetition constraint (no domain sampled past `repetition_factor` × its available data).
295+
296+
#### `swarm`
297+
298+
| Field | Description | Default |
299+
|-------|-------------|---------|
300+
| `variants` | Number of mixture variants to generate | `1` |
301+
| `seed` | Random seed | `42` |
302+
| `min_strength` / `max_strength` | Dirichlet concentration range. Low = diverse/extreme mixes; high = mixes near the prior | `0.1` / `5.0` |
303+
| `min_source_strength` / `max_source_strength` | Override strength for source-level sampling | — |
304+
| `min_topic_strength` / `max_topic_strength` | Override strength for topic-level sampling | — |
305+
| `minimum_weight` | Domains below this are zeroed out | `0.002` |
306+
| `minimum_source_weight` / `minimum_topic_weight` | Override `minimum_weight` at source or topic level | — |
307+
| `nonzero_weight` | Domain keys that must be nonzero in every variant | — |
308+
| `manual_prior` | Override source-level Dirichlet prior, e.g. `{dclm: 0.75, stack-edu: 0.25}` | — |
309+
| `manual_topic_prior` | Override topic-level Dirichlet prior for specific keys (topic still sampled) | — |
310+
| `repetition_factor` | Max allowed data repetition per domain | `1.0` |
311+
| `enable_bound` | Enforce the repetition bound when sampling | `true` |
312+
| `existing_mix_file` | Pickle of prior swarm ratios; new samples too close are rejected | — |
313+
362314
## Development
363315

364316
```bash
Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
# DCLM baseline fit config
2-
# Fits regression models against 127 DCLM swarm runs across 24 DataDelve domains
1+
# Config for fitting one regression model per family, corresponding to RQ5, Table 3 (middle row) Section 3.3.3 in Olmix https://arxiv.org/abs/2602.12237
32

43
swarm:
54
ratios: ratios.csv # https://huggingface.co/datasets/allenai/olmix/blob/main/dclm_swarm/ratios.csv
@@ -31,7 +30,6 @@ priors: # from https://huggingface.co/datasets/allenai/olmix/blob/main/dclm_swar
3130
sports_and_fitness: 0.03455368961865927
3231
transportation: 0.02686612455867175
3332
travel_and_tourism: 0.01667626901270792
34-
total_tokens: 5_000_000_000_000
3533
token_counts:
3634
adult_content: 67_760_078_203
3735
art_and_design: 70_659_711_995

configs/examples/fit/example.yaml

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Example config
2+
3+
swarm:
4+
ratios: ratios.csv # https://huggingface.co/datasets/allenai/olmix/blob/main/dclm_swarm/ratios.csv
5+
metrics: metrics.csv # https://huggingface.co/datasets/allenai/olmix/blob/main/dclm_swarm/metrics.csv
6+
7+
priors:
8+
relative_sizes:
9+
adult_content: 0.015
10+
art_and_design: 0.035
11+
crime_and_law: 0.030
12+
education_and_jobs: 0.045
13+
electronics_and_hardware: 0.025
14+
entertainment: 0.080
15+
fashion_and_beauty: 0.020
16+
finance_and_business: 0.060
17+
food_and_dining: 0.025
18+
games: 0.030
19+
health: 0.050
20+
history_and_geography: 0.035
21+
home_and_hobbies: 0.030
22+
industrial: 0.015
23+
literature: 0.040
24+
politics: 0.045
25+
religion: 0.020
26+
science_math_and_technology: 0.070
27+
social_life: 0.060
28+
software: 0.050
29+
software_development: 0.060
30+
sports_and_fitness: 0.040
31+
transportation: 0.020
32+
travel_and_tourism: 0.030
33+
token_counts:
34+
adult_content: 45_000_000_000
35+
art_and_design: 105_000_000_000
36+
crime_and_law: 90_000_000_000
37+
education_and_jobs: 135_000_000_000
38+
electronics_and_hardware: 75_000_000_000
39+
entertainment: 240_000_000_000
40+
fashion_and_beauty: 60_000_000_000
41+
finance_and_business: 180_000_000_000
42+
food_and_dining: 75_000_000_000
43+
games: 90_000_000_000
44+
health: 150_000_000_000
45+
history_and_geography: 105_000_000_000
46+
home_and_hobbies: 90_000_000_000
47+
industrial: 45_000_000_000
48+
literature: 120_000_000_000
49+
politics: 135_000_000_000
50+
religion: 60_000_000_000
51+
science_math_and_technology: 210_000_000_000
52+
social_life: 180_000_000_000
53+
software: 150_000_000_000
54+
software_development: 180_000_000_000
55+
sports_and_fitness: 120_000_000_000
56+
transportation: 60_000_000_000
57+
travel_and_tourism: 90_000_000_000
58+
59+
regression:
60+
type: log_linear
61+
seed: 0
62+
n_test: 0
63+
train_split: 1.0
64+
aggregate_task_families: true
65+
66+
proposer:
67+
type: exact
68+
temperature: null
69+
kl_reg: 0.1
70+
fit_only: true
71+
make_worst_mix: false
72+
73+
constraints:
74+
enabled: false
75+
target_tokens: null
76+
repetition_factor: 5.0
77+
78+
filtering:
79+
drop_metrics: []
80+
obj_weights: {}

0 commit comments

Comments
 (0)