Skip to content

Commit 9d93078

Browse files
committed
[DEBUG]: update advantage annotation procedures
1 parent 4c94179 commit 9d93078

File tree

9 files changed

+223
-467
lines changed

9 files changed

+223
-467
lines changed

README.md

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -272,49 +272,45 @@ For gradient-based optimization, dataset splitting, and all other methods, see t
272272

273273
Stage Advantage decomposes long-horizon tasks into semantic stages and provides stage-aware advantage signals for policy training. It addresses the numerical instability of prior non-stage approaches by computing advantage as progress differentials within each stage, yielding smoother and more stable supervision.
274274

275-
The full pipeline has four stages:
275+
The full pipeline has five steps:
276276

277277
```
278-
Stage 0: GT Labeling Stage 1: Train Advantage Estimator → Stage 2: Advantage Estimation Stage 3: AWBC Training
278+
Step 0: Annotate stage_progress_gt (manual) Step 1: Train Advantage Estimator → Step 2: Predict Advantage Step 3: Discretize Advantage → Step 4: AWBC Training
279279
```
280280

281281
### Quick Start
282282

283-
**Stage 0 — GT Data Labeling**: Compute advantage values and discretize into `task_index` labels.
283+
**Step 0 — Annotate `stage_progress_gt`** (manual, no code provided): For each episode, annotate start/end timestamps and subtask split points, then compute per-frame `stage_progress_gt` (linear progress 0→1 within each subtask) and write it into the parquet files.
284+
285+
**Step 1 — Train Advantage Estimator**: Fine-tune a pi0-based model to predict advantage from observations.
284286

285287
```bash
286-
cd stage_advantage/annotation
287-
python gt_label.py <dataset_path> \
288-
--threshold 30 --chunk-size 50 --discretion-type binary \
289-
--advantage-source absolute_advantage
288+
uv run python scripts/train_pytorch.py ADVANTAGE_TORCH_KAI0_FLATTEN_FOLD --exp_name=run1 --save_interval 10000
290289
```
291290

292-
For batch labeling across multiple dataset variants, see `stage_advantage/annotation/gt_labeling.sh`.
293-
294-
**Stage 1 — Train Advantage Estimator**: Fine-tune a pi0-based model to predict advantage from observations.
291+
**Step 2 — Predict Advantage**: Use the trained estimator to label datasets with `absolute_advantage` and `relative_advantage`.
295292

296293
```bash
297-
uv run python scripts/train_pytorch.py ADVANTAGE_TORCH_KAI0_FLATTEN_FOLD --exp_name=run1 --save_interval 10000
294+
uv run python stage_advantage/annotation/eval.py Task-A KAI0 /path/to/dataset
298295
```
299296

300-
For a ready-to-use script with environment setup (conda/venv activation, DDP configuration) and automatic log management, see `stage_advantage/annotation/train_estimator.sh`.
301-
302-
**Stage 2 — Advantage Estimation on New Data**: Use the trained estimator to label datasets with predicted advantage values.
297+
**Step 3 — Discretize Advantage**: Bin predicted advantages into positive/negative `task_index` labels.
303298

304299
```bash
305-
uv run python stage_advantage/annotation/eval.py Task-A KAI0 /path/to/dataset
300+
cd stage_advantage/annotation
301+
python discretize_advantage.py <dataset_path> \
302+
--threshold 30 --chunk-size 50 --discretion-type binary \
303+
--advantage-source absolute_advantage
306304
```
307305

308-
For a ready-to-use script with environment setup and status logging, see `stage_advantage/annotation/eval.sh`.
306+
For batch labeling across PI06/KAI0 variants, see `stage_advantage/annotation/discretize_advantage.sh`.
309307

310-
**Stage 3 — AWBC Training**: Train a policy with Advantage-Weighted Behavior Cloning.
308+
**Step 4 — AWBC Training**: Train a policy with Advantage-Weighted Behavior Cloning.
311309

312310
```bash
313311
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_flatten_fold_awbc --exp_name=run1
314312
```
315313

316-
For a ready-to-use script with environment setup and automatic log management, see `stage_advantage/awbc/train_awbc.sh`.
317-
318314
For the full pipeline details, configuration instructions, and all parameters, see [`stage_advantage/README.md`](stage_advantage/README.md).
319315

320316
## Train-Deploy Alignment

stage_advantage/README.md

Lines changed: 134 additions & 152 deletions
Large diffs are not rendered by default.
Lines changed: 14 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,31 @@
1-
## Annotation: Stage 0–2 (Labeling, Estimator Training, Eval)
1+
## Annotation: Steps 1–3 (Estimator Training, Eval, Discretize)
22

3-
This directory contains **Stage 0** (GT labeling with `gt_label.py` / `gt_labeling.sh`), **Stage 1** (advantage estimator training via `scripts/train_pytorch.py`), and **Stage 2** (advantage estimation on new data via `eval.py`). All commands below assume you are at the **repository root** unless noted. Full pipeline and options are in the [parent README](../README.md).
3+
This directory contains **Step 1** (advantage estimator training via `scripts/train_pytorch.py`), **Step 2** (advantage prediction on data via `eval.py`), and **Step 3** (discretize advantages into positive/negative via `discretize_advantage.py`). All commands below assume you are at the **repository root** unless noted. Full pipeline and options are in the [parent README](../README.md).
44

55
### Quick Start
66

77
```bash
8-
# Step 1: Label a dataset with advantage-based task_index (GT labels from progress)
9-
# Edit DATA_PATH in gt_labeling.sh, then from repo root:
10-
bash stage_advantage/annotation/gt_labeling.sh
11-
12-
# Step 2: Train the Advantage Estimator (update config.py repo_id / pytorch_weight_path first)
13-
# From repo root:
8+
# Step 1: Train the Advantage Estimator (update config.py repo_id / pytorch_weight_path first)
149
uv run python scripts/train_pytorch.py ADVANTAGE_TORCH_KAI0_FLATTEN_FOLD --exp_name=run1 --save_interval 10000
15-
# Or: uv run python scripts/train_pytorch.py ADVANTAGE_TORCH_PI06_FLATTEN_FOLD --exp_name=run1 --save_interval 10000
1610

17-
# Step 3: Evaluate the trained estimator on new data (PI06 or KAI0)
18-
# From repo root:
11+
# Step 2: Predict advantages on a dataset (update MODELS_CONFIG_MAP in eval.py first)
1912
uv run python stage_advantage/annotation/eval.py Task-A KAI0 /path/to/dataset
2013

21-
# Step 4: Use the advantage-labeled data for AWBC (Stage 3)
22-
# After Stage 2, run gt_labeling.sh with DATA_PATH = eval repo (or gt_label.py --advantage-source absolute_advantage).
23-
# Then from repo root:
14+
# Step 3: Discretize advantages into positive/negative task_index labels
15+
# Edit DATA_PATH in discretize_advantage.sh, then:
16+
bash stage_advantage/annotation/discretize_advantage.sh
17+
18+
# Step 4: AWBC training (see awbc/README.md)
2419
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_flatten_fold_awbc --exp_name=run1
2520
```
2621

2722
### File Descriptions
2823

29-
| File | Stage | Description |
24+
| File | Step | Description |
3025
|---|---|---|
31-
| `gt_label.py` | 0 | Core script: computes advantage from progress/absolute_advantage and assigns `task_index` to parquet frames |
32-
| `gt_labeling.sh` | 0 | Batch labeling: prepares dataset dirs and runs `gt_label.py` (only .sh in this dir) |
33-
| `eval.py` | 2 | Evaluates a trained estimator on a dataset, writing predicted advantages to new parquets |
26+
| `discretize_advantage.py` | 3 | Reads advantage columns, bins into positive/negative `task_index`, writes `meta/tasks.jsonl` |
27+
| `discretize_advantage.sh` | 3 | Batch wrapper: prepares dataset dirs and runs `discretize_advantage.py` for PI06/KAI0 variants |
28+
| `eval.py` | 2 | Predicts advantage values on a dataset using a trained estimator |
3429
| `evaluator.py` | 2 | `SimpleValueEvaluator`: batched GPU inference with parallel video loading and prefetching |
3530

36-
For Stage 0 parameters, Stage 1 config fields, Stage 2 `MODELS_CONFIG_MAP`, and end-to-end AWBC order, see the [parent README](../README.md).
31+
Step 1 training commands and Step 0 (manual annotation) are documented in the [parent README](../README.md).

stage_advantage/annotation/gt_label.py renamed to stage_advantage/annotation/discretize_advantage.py

Lines changed: 28 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,24 @@
11
#!/usr/bin/env python3
22
"""
3-
# python label.py <dataset_path> --threshold 30 --chunk-size 50 --discretion-type binary --advantage-source absolute_advantage --stage-nums 2 --dry-run
4-
Script to modify task_index in parquet files based on progress rewards.
3+
# python discretize_advantage.py <dataset_path> --threshold 30 --chunk-size 50 --discretion-type binary --advantage-source absolute_advantage --stage-nums 2 --dry-run
4+
Script to modify task_index in parquet files based on predicted advantage values.
55
66
This script:
77
1. Reads all parquet files from path/data/chunk-*/*.parquet
8-
2. Calculates reward as: progress[i+50] - progress[i] for each frame
9-
3. Computes reward distribution statistics across all parquets
10-
4. Labels frames with task_index based on reward percentile threshold
8+
2. Reads per-frame advantage from the specified source column (absolute_advantage or relative_advantage)
9+
3. Computes advantage distribution statistics across all parquets
10+
4. Labels frames with task_index based on advantage percentile threshold
1111
Binary mode:
12-
- task_index=0 for rewards in bottom (1-threshold)%
13-
- task_index=1 for rewards in top threshold%
12+
- task_index=0 for advantages in bottom (1-threshold)%
13+
- task_index=1 for advantages in top threshold%
1414
n_slices mode:
15-
- task_index=0 to (n-1) based on reward percentiles (higher reward -> higher task_index)
15+
- task_index=0 to (n-1) based on advantage percentiles (higher advantage -> higher task_index)
1616
- Each slice contains ~(100/n)% of frames
1717
1818
Stage-based mode (--stage-nums > 1):
1919
- Each frame is assigned to a stage based on its stage_progress_gt value
2020
- Frames with stage_progress_gt in [i/stage_nums, (i+1)/stage_nums) belong to stage i
21-
- Each stage has its own reward statistics and percentile boundaries
21+
- Each stage has its own advantage statistics and percentile boundaries
2222
- task_index is assigned based on stage-specific percentiles
2323
"""
2424

@@ -35,38 +35,26 @@
3535
from tqdm import tqdm
3636

3737

38-
def calculate_rewards(data: pd.DataFrame, chunk_size: int = 50, advantage_source: str = "progress") -> np.ndarray:
38+
def calculate_rewards(data: pd.DataFrame, chunk_size: int = 50, advantage_source: str = "absolute_advantage") -> np.ndarray:
3939
"""
40-
Calculate rewards based on progress differences.
40+
Read per-frame advantage values from the specified source column.
4141
4242
Args:
43-
data: DataFrame containing 'progress' column
44-
chunk_size: Number of frames to look ahead for progress calculation
43+
data: DataFrame containing the advantage column
44+
chunk_size: Not used (kept for API compatibility)
45+
advantage_source: Column name — "absolute_advantage" or "relative_advantage"
4546
4647
Returns:
47-
Array of rewards for each frame
48+
Array of advantage values for each frame
4849
"""
4950
n_frames = len(data)
50-
rewards = np.zeros(n_frames, dtype=np.float32)
5151
if advantage_source == "absolute_advantage":
52-
absolute_advantage = data['absolute_advantage'].values
53-
for i in range(n_frames):
54-
rewards[i] = absolute_advantage[i]
52+
return data['absolute_advantage'].values.astype(np.float32)
5553
elif advantage_source == "relative_advantage":
56-
relative_advantage = data['relative_advantage'].values
57-
for i in range(n_frames):
58-
rewards[i] = relative_advantage[i]
59-
elif advantage_source == "progress":
60-
progress = data['progress'].values
61-
for i in range(n_frames):
62-
if i + chunk_size < n_frames:
63-
rewards[i] = progress[i + chunk_size] - progress[i]
64-
else:
65-
# For frames near the end, use the last available frame
66-
rewards[i] = (progress[-1] - progress[i]) / (len(progress) - i) * chunk_size
54+
return data['relative_advantage'].values.astype(np.float32)
6755
else:
68-
raise ValueError(f"Unknown advantage source: {advantage_source}")
69-
return rewards
56+
raise ValueError(f"Unknown advantage source: {advantage_source}. "
57+
f"Must be 'absolute_advantage' or 'relative_advantage'.")
7058

7159

7260
def get_stage_index(stage_progress_gt: float, stage_nums: int) -> int:
@@ -91,7 +79,7 @@ def get_stage_index(stage_progress_gt: float, stage_nums: int) -> int:
9179
return stage_idx
9280

9381

94-
def collect_all_rewards(base_path: str, chunk_size: int = 50, advantage_source: str = "progress",
82+
def collect_all_rewards(base_path: str, chunk_size: int = 50, advantage_source: str = "absolute_advantage",
9583
stage_nums: int = 1) -> Tuple[Dict[int, List[float]], List[str]]:
9684
"""
9785
Collect all rewards from all parquet files to compute statistics.
@@ -223,9 +211,9 @@ def update_tasks_jsonl(base_path: str, discretion_type: str, n_slices: int = 10)
223211
def assign_task_index(parquet_file: str, threshold_percentile: float,
224212
chunk_size: int = 50, discretion_type: str = "binary",
225213
percentile_boundaries: List[float] = None, n_slices: int = 10,
226-
advantage_source: str = "progress") -> None:
214+
advantage_source: str = "absolute_advantage") -> None:
227215
"""
228-
Assign task_index to frames in a parquet file based on reward threshold.
216+
Assign task_index to frames in a parquet file based on advantage threshold.
229217
(Used when stage_nums=1)
230218
231219
Args:
@@ -269,7 +257,7 @@ def assign_task_index_staged(parquet_file: str,
269257
chunk_size: int = 50,
270258
discretion_type: str = "binary",
271259
n_slices: int = 10,
272-
advantage_source: str = "progress",
260+
advantage_source: str = "absolute_advantage",
273261
stage_nums: int = 1) -> None:
274262
"""
275263
Assign task_index to frames in a parquet file based on stage-specific thresholds.
@@ -330,7 +318,7 @@ def assign_task_index_staged(parquet_file: str,
330318

331319
def main():
332320
parser = argparse.ArgumentParser(
333-
description="Modify task_index in parquet files based on progress rewards"
321+
description="Discretize predicted advantage values into task_index labels"
334322
)
335323
parser.add_argument(
336324
"data_path",
@@ -365,8 +353,9 @@ def main():
365353
parser.add_argument(
366354
"--advantage-source",
367355
type=str,
368-
default="progress",
369-
choices=["progress", "absolute_advantage", "relative_advantage"]
356+
default="absolute_advantage",
357+
choices=["absolute_advantage", "relative_advantage"],
358+
help="Which predicted advantage column to use (default: absolute_advantage)"
370359
)
371360
parser.add_argument(
372361
"--stage-nums",
@@ -396,7 +385,7 @@ def main():
396385
print(f"Threshold: {args.threshold}% (top {args.threshold}% will be task_index=1)")
397386
elif args.discretion_type == "n_slices":
398387
print(f"Number of slices: {args.n_slices}")
399-
print(f"Progress offset: {args.chunk_size} frames")
388+
print(f"Chunk size: {args.chunk_size} frames")
400389
print(f"Stage nums: {args.stage_nums}")
401390
if args.stage_nums > 1:
402391
step = 1.0 / args.stage_nums

stage_advantage/annotation/gt_labeling.sh renamed to stage_advantage/annotation/discretize_advantage.sh

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
#!/bin/bash
22
###############################################################################
3-
# Prepare advantage-labeled datasets for training the Advantage Estimator.
3+
# Discretize predicted advantages into positive/negative task_index labels
4+
# for AWBC training. Run this AFTER Stage 2 (eval.py) has produced
5+
# data_PI06_*/data_KAI0_* subdirs with advantage columns.
46
###############################################################################
57
set -xe
68
set -o pipefail
@@ -18,7 +20,7 @@ dir_name=$(dirname "$DATA_PATH")/${base_name}_advantage_data
1820
prepare_and_label() {
1921
local data_subdir=$1 # source data subfolder name (e.g. data_PI06_100000 or data_KAI0_100000)
2022
local output_name=$2 # output dataset name suffix
21-
local extra_args=$3 # extra arguments for gt_label.py
23+
local extra_args=$3 # extra arguments for discretize_advantage.py
2224
local target_path="${dir_name}/${output_name}"
2325

2426
echo "============================================================"
@@ -32,7 +34,7 @@ prepare_and_label() {
3234
# Symlink videos (shared, read-only)
3335
ln -sfn "${DATA_PATH}/videos" "${target_path}/videos"
3436

35-
# Copy norm_stats and meta (will be modified by gt_label.py)
37+
# Copy norm_stats and meta (will be modified by discretize_advantage.py)
3638
cp -f "${DATA_PATH}/norm_stats.json" "${target_path}/norm_stats.json"
3739
cp -rf "${DATA_PATH}/meta" "${target_path}/meta"
3840

@@ -42,8 +44,8 @@ prepare_and_label() {
4244
fi
4345
cp -r "${DATA_PATH}/${data_subdir}" "${target_path}/data"
4446

45-
# Run gt_label.py to assign task_index and update tasks.jsonl
46-
python "${SCRIPT_DIR}/gt_label.py" "${target_path}" \
47+
# Run discretize_advantage.py to assign task_index and update tasks.jsonl
48+
python "${SCRIPT_DIR}/discretize_advantage.py" "${target_path}" \
4749
--threshold 30 \
4850
--chunk-size 50 \
4951
--discretion-type binary \
@@ -67,6 +69,6 @@ echo " All datasets labeled successfully!"
6769
echo ""
6870
echo " Output directory: ${dir_name}"
6971
echo ""
70-
echo " Next step: set repo_id in config.py to the target dataset path,"
71-
echo " then run: uv run python scripts/train_pytorch.py ADVANTAGE_TORCH_* --exp_name=run1 --save_interval 10000"
72+
echo " Next step: set repo_id in AWBC config to the target dataset path,"
73+
echo " then run: XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_*_awbc --exp_name=run1"
7274
echo "============================================================"

stage_advantage/annotation/eval.sh

Lines changed: 0 additions & 70 deletions
This file was deleted.

0 commit comments

Comments
 (0)