OrderLab
diff --git a/‎docs/ae.md‎
Lines changed: 34 additions & 5 deletions b/‎docs/ae.md‎
Lines changed: 34 additions & 5 deletions
diff --git a/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/README.md‎
Lines changed: 7 additions & 0 deletions b/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/config.yml‎
Lines changed: 10 additions & 0 deletions b/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/config.yml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/run.sh‎
Lines changed: 8 additions & 0 deletions b/‎docs/assets/examples/traincheck-collect/gpt2-pretrain-config/run.sh‎
Lines changed: 8 additions & 0 deletions
@@ -9,8 +9,8 @@ We provide pre-collected traces and pre-inferred invariants to simplify and spee
 - [ ] Environment set up (Python, dependencies, 2 CUDA GPUs with ≥ 12GiB memory each)
 - [ ] (*Optional*) Downloaded pre-collected / pre-computed data
 - [ ] Ran **[Silent Issue Detection](#eval-silent-issue-detection)** experiment
+- [ ] Ran **[Invariant Transferability](#eval-transferability)** evaluation
 - [ ] Ran **[False Positive Rate](#false-positive-rate)** evaluation
-- [ ] Ran **[Transferability](#eval-transferability)** evaluation
 - [ ] Ran **[Performance Overhead](#eval-performance-overhead)** measurement
 - [ ] Verified outputs match expected results (tolerances noted per experiment)
 
@@ -21,8 +21,8 @@ We provide pre-collected traces and pre-inferred invariants to simplify and spee
 This artifact allows you to reproduce the major 4 evaluation results presented in the paper.
 
 - [ ] Ran **[Silent Issue Detection (Section 5.1 and 5.2)](#eval-silent-issue-detection)** experiment
-- [ ] Ran **[False Positive Rate (Section 5.3)](#false-positive-rate)** evaluation
-- [ ] Ran **[Transferability (Section 5.4)](#eval-transferability)** evaluation
+- [ ] Ran **[Invariant Transferability (Section 5.3)](#eval-transferability)** evaluation
+- [ ] Ran **[False Positive Rate (Section 5.4)](#false-positive-rate)** evaluation
 - [ ] Ran **[Performance Overhead (Section 5.5)](#eval-performance-overhead)** measurement
 
 ### ⏱️ Recommended Evaluation Order
@@ -96,20 +96,49 @@ It will help you get familiar with the workflow and also verify that your instal
 
 ## Eval: False Positive Rate
 
+⏳ Estimated Completion Time: TBD hour.
+- Trace Collection: x hours
+- Invariant Inference: x hours
+- Invariant Checking: x hours
+
+### 🎯 Goal
+
+This evaluation measures the false positive rate of alarms from TrainCheck's invariants.
+
+### 📂 Resources & Scripts
+
+- Automation Scripts:
+    1. TBD
+    2. TBD
+    3. TBD
+- Workloads: PyTorch official pipelines, accessible at TBD FP WORKLOAD
+
+### 🛠 How to Run
+xxx
+
 ## Eval: Transferability
 
+⏳ Estimated Completion Time: TBD hour.
+- Trace Collection: x hours
+- Invariant Inference: x hours
+- Invariant Checking: x hours
+
+### 🎯 Goal
+
+This evaluation measures the transferability of invariants inferred by TrainCheck. 
+
 ## Eval: Performance Overhead
 
 ⏳ Estimated Completion Time: 1.5 hour.
 
 ### 🎯 Goal
 
-This evaluation measures the runtime overhead introduced by TrainCheck’s instrumentation compared to uninstrumented runs across a set of representative ML workloads, during the invariant checking stage. The results correspond to Section 5.5 of the paper.
+This evaluation measures the runtime overhead introduced by TrainCheck’s instrumentation compared to un-instrumented runs across a set of representative ML workloads, during the invariant checking stage. The results correspond to Section 5.5 of the paper.
 
 
 ### 📂 Resources & Scripts
 
-- Automation Script: 
+- Automation Scripts: 
   - `eval_scripts/perf_benchmark/run_all.xsh`: run the experiments and collect data.
   - `eval_scripts/perf_benchmark/analysis.xsh`: analyze raw data and produce input for the plot script.
   - `eval_scripts/perf_benchmark/plot_e2e.py` and `eval_scripts/perf_benchmark/plot_micro.py`: plot the figures in Section 5.5.
 
@@ -0,0 +1,7 @@
+Language model pretraining script from the official examples of the transformers library.
+Trains GPT-2 on 
+
+Modifications:
+1. 10 steps per training/testing epoch.
+2. stage annotations
+3. skip instrumentation for the tokenization step
@@ -0,0 +1,10 @@
+modules_to_instr:
+- torch
+- transformers
+- accelerate
+pyscript: run_clm_no_trainer.py
+shscript: run.sh
+copy_all_files: true
+models_to_track:
+- model
+model_tracker_style: proxy
@@ -0,0 +1,8 @@
+python run_clm_no_trainer.py \
+    --dataset_name wikitext \
+    --dataset_config_name wikitext-2-raw-v1 \
+    --model_name_or_path distilbert/distilgpt2 \
+    --output_dir /tmp/test-clm \
+    --per_device_train_batch_size 2 \
+    --per_device_eval_batch_size 2 \
+    --num_train_epochs 1 \