Skip to content

Commit f115a3c

Browse files
Merge pull request #28 from TRI-ML/update_readme
Update README.md and Incorporate Quickstart Resources
2 parents ba941ed + 8dd252d commit f115a3c

File tree

8 files changed

+774
-4
lines changed

8 files changed

+774
-4
lines changed

README.md

Lines changed: 158 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,39 @@
11
# sequentialized_barnard_tests
22
A collection of sequential statistical hypothesis testing methods for two-by-two contingency tables.
33

4-
## Development Instructions
4+
[[Paper](https://www.arxiv.org/abs/2503.10966)]  [[Website](https://tri-ml.github.io/step/)]
5+
<!-- ## Development Plan
6+
This codebase will be developed into a standalone pip package, with planned release date June 2025.
57
6-
An example of the enrivonment setup is shown below.
8+
Current features:
9+
- Fully automated STEP policy synthesis
10+
- Baseline sequential method implementations of [SAVI](https://www.sciencedirect.com/science/article/pii/S0167715223000597?via%3Dihub) and [Lai](https://projecteuclid.org/journals/annals-of-statistics/volume-16/issue-2/Nearly-Optimal-Sequential-Tests-of-Composite-Hypotheses/10.1214/aos/1176350840.full) procedures.
11+
- Validation tools
12+
- STEP policy visualization
13+
- Verification of Type-1 Error control
14+
- Unit tests
15+
- Method functionality
16+
- Recreate results from our [paper](https://www.arxiv.org/abs/2503.10966).
17+
18+
Features in development:
19+
- Demonstration scripts for each stage of the STEP evaluation pipeline
20+
- Approximately optimal risk budget estimation tool based on evaluator priors on $$(p_0, p_1)$$
21+
- Fundamental limit estimator for guiding evaluation effort
22+
- Determine if a particular effect size is plausibly discoverable given the evaluation budget
23+
- Baseline implementation of a sequential Barnard procedure which controls Type-1 error -->
24+
25+
## Installation Instructions \[Standard\]
26+
The basic environmental setup is shown below. A virtual / conda environment may be constructed; however, the requirements are quite lightweight and this is probably not needed.
27+
```bash
28+
$ cd <some_directory>
29+
$ git clone [email protected]:TRI-ML/sequentialized_barnard_tests.git
30+
$ cd sequentialized_barnard_tests
31+
$ pip install -r requirements.txt
32+
$ pip install -e .
33+
```
34+
35+
## Installation Instructions \[Dev\]
36+
For potential contributors and developers, we recommend a virtualenv:
737
```bash
838
$ cd <some_directory>
939
$ git clone [email protected]:TRI-ML/sequentialized_barnard_tests.git
@@ -14,3 +44,129 @@ $ pip install -r requirements.txt
1444
$ pip install -e .
1545
$ pre-commit install
1646
```
47+
48+
We assume that any specified virtual / conda environment has been activated for all subsequent code snippets.
49+
50+
# Quick Start Guides
51+
We include key notes for understanding the core ideas of the STEP code. Quick-start resources are included in both shell script and notebook form.
52+
53+
## Quick Start Guide: Making a STEP Policy for Specific \{n_max, alpha\}
54+
55+
### (1A) Understanding the Accepted Shape Parameters
56+
In order to synthesize a STEP Policy for specific values of n_max and alpha, one additional set of parametric decisions will be required. The user will need to set the risk budget shape, which is specified by choice of function family (p-norm vs zeta-function) and particular shape parameter. The shape parameter is real-valued; it is used directly for zeta functions and is exponentiated for p-norms.
57+
58+
**For p-norms**
59+
- $$\text{Shape Parameter: } \lambda \in \mathbb{R}$$
60+
- $$\text{Accumulated Risk Budget}(n) = \alpha \cdot (\frac{n}{n_{max}})^{\exp{(\lambda)}}$$
61+
62+
**For zeta function**
63+
- $$\text{Shape Parameter: } \lambda \in \mathbb{R}$$
64+
- $$\text{Accumulated Risk Budget}(n) = \frac{\alpha}{Z(n_{max})} \cdot \sum_{i=1}^n (\frac{1}{i})^{\lambda}$$
65+
- $$Z(n_{max}) = \sum_{i=1}^{n_{max}} (\frac{1}{i})^{\lambda}$$
66+
67+
The user may confirm that in each case, evaluating the accumulated risk budget at $`n=n_{max}`$ returns precisely $\alpha$.
68+
69+
70+
### (1B) Shape Parameters in Code
71+
In the codebase, the value of $\lambda$ is set in the \{shape_parameter\} variable. This variable is real-valued.\
72+
\
73+
The family shape is set by the \{use_p_norm\} variable. This variable is Boolean.
74+
- If it is True, then p-norm family is used.
75+
- If it is False, the zeta-function family is used.
76+
77+
78+
### (1C) Arbitrary Risk Budgets
79+
Generalizing the accepted risk budgets to arbitrary monotonic sequences $`\{0, \epsilon_1 > 0, \epsilon_2 > \epsilon_1, ..., \epsilon_{n_{max}} = \alpha\}`$ is in the development pipeline, but is **not handled at present in the code**.
80+
81+
82+
### (2A) Running STEP Policy Synthesis
83+
Having decided an appropriate form for the risk budget shape, policy synthesis is straightforward to run. From the base directory, the general command would be:
84+
85+
```bash
86+
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {shape_parameter} -up {use_p_norm}
87+
```
88+
89+
### (2B) What If I Don't Know the Right Risk Budget?
90+
We recommend using the default linear risk budget, which is the shape *used in the paper*. This corresponds to \{shape_parameter\}$`= 0.0`$ for each shape family. Thus, *each of the following commands constructs the same policy*:
91+
92+
```bash
93+
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha}
94+
```
95+
```bash
96+
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {0.0} -up "True"
97+
```
98+
```bash
99+
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha} -pz {0.0} -up "False"
100+
```
101+
102+
Note: For \{shape_parameter\} $`\neq 0`$, the shape families differ. Therefore, the choice of \{use_p_norm\} *will affect the STEP policy*.
103+
104+
### (2C) Disclaimers
105+
- Running the policy synthesis will save a durable policy to the user's local machine. This policy can be reused for all future settings requiring the same \{n_max, alpha\} combination. For \{n_max\} $`< 500`$, the amount of required memory is < 5Mb. The policy is saved under:
106+
```bash
107+
$ sequentialized_barnard_tests/policies/
108+
```
109+
110+
- At present, we have not tested extensively beyond \{n_max\}$`=500`$. Going beyond this limit may lead to issues, and the likelihood will grow the larger \{n_max\} is set to be. The code will also require increasing amounts of RAM as \{n_max\} is increased.
111+
112+
## Quick Start Guide: Evaluation on Real Data
113+
114+
We now assume that a STEP policy has been constructed for the target problem. This can either be one of the default policies, or a newly constructed one following the recipe in the preceding section.
115+
116+
### (1) Formatting the Real Data
117+
The data should be formatted into a numpy array of shape $(N, 2)$. The user should create a new project directory and save the data within it:
118+
```bash
119+
$ mkdir data/{new_project_dir}
120+
$ cp path/to/{my_data_file.npy} data/{new_project_dir}/{my_data_file.npy}
121+
```
122+
123+
We give an example that *would have generated* the included data:
124+
```bash
125+
$ mkdir data/example_clean_spill
126+
$ cp some/path/to/TRI_CLEAN_SPILL_v4.npy data/example_clean_spill/TRI_CLEAN_SPILL_v4.npy
127+
```
128+
129+
### (2) Running the Evaluation
130+
Then, the user need simply run the evaluation script, which requires the project directory and file in addition to the policy synthesis arguments:
131+
```bash
132+
$ python evaluation/run_step_on_evaluation_data.py -p "{new_project_dir}" -f "{my_data_file.npy}" -n {n_max} -a {alpha} -pz {shape_parameter} -up "{use_p_norm}"
133+
```
134+
135+
This will print the evaluation result to the terminal, as well as save key information in a timestamped json file.
136+
137+
We illustrate this via an evaluation on the default data:
138+
```bash
139+
$ python evaluation/run_step_on_evaluation_data.py -p "example_clean_spill" -f "TRI_CLEAN_SPILL_v4.npy" -n {200} -a {0.05} -pz {0.0} -up "False"
140+
```
141+
142+
# Code Overview
143+
Scripts to generate and visualize STEP policies are included under:
144+
```bash
145+
$ scripts/
146+
```
147+
148+
Any resulting visualizations are stored in:
149+
```bash
150+
$ media/
151+
```
152+
153+
The evaluation environment for real data is included in:
154+
```bash
155+
$ evaluation/
156+
```
157+
158+
and the associated evaluation data is stored in:
159+
```bash
160+
$ data/
161+
```
162+
163+
# Citation
164+
165+
```
166+
@inproceedings{snyder2025step,
167+
title = {Is Your Imitation Learning Policy Better Than Mine? Policy Comparison with Near-Optimal Stopping},
168+
author = {Snyder, David and Hancock, Asher James and Badithela, Apurva and Dixon, Emma and Miller, Patrick and Ambrus, Rares Andrei and Majumdar, Anirudha and Itkina, Masha and Nishimura, Haruki},
169+
booktitle={Proceedings of the Robotics: Science and Systems Conference (RSS)},
170+
year = {2025},
171+
}
172+
```
928 Bytes
Binary file not shown.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"result": [{"Hypothesis": "P0 < P1", "Decision": "P0 >= P1", "Time": 9}], "method": "STEP", "params": [{"n_max": 200, "alpha": 0.05}], "p0_hat": 0.8, "p1_hat": 0.28, "N": 50}
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
import argparse
2+
import json
3+
import os
4+
from datetime import datetime
5+
6+
import numpy as np
7+
8+
from sequentialized_barnard_tests.base import Decision, Hypothesis
9+
from sequentialized_barnard_tests.step import MirroredStepTest
10+
11+
# from ..sequentialized_barnard_tests.lai import MirroredLaiTest
12+
# from ..sequentialized_barnard_tests.savi import MirroredOracleSaviTest
13+
14+
15+
if __name__ == "__main__":
16+
parser = argparse.ArgumentParser(
17+
description=(
18+
"This script runs a pre-computed STEP policy on real data. The data "
19+
"is assumed to be stored as '<base_path>/data/<project_folder>/<file_name>.npy.' "
20+
"Further, the data is assumed to be of shape (N, 2) corresponding to the success-"
21+
"failure values observed in evaluation trials. Input parameters determine the nature "
22+
"of the STEP test, certain save options, and the data project to access. "
23+
)
24+
)
25+
parser.add_argument(
26+
"-p",
27+
"--data_folder",
28+
type=str,
29+
default="example_clean_spill",
30+
help=(
31+
"Relative path added to <base_path>/data/ which specifies the desired "
32+
"evaluation data folder, from which to run the STEP test. Defaults to ''."
33+
),
34+
)
35+
parser.add_argument(
36+
"-f",
37+
"--data_file",
38+
type=str,
39+
default="TRI_CLEAN_SPILL_v4.npy",
40+
help=(
41+
"Relative path added to <base_path>/data/<project_folder>/ which specifies the desired "
42+
"evaluation data on which to run the STEP test. Defaults to ''."
43+
),
44+
)
45+
parser.add_argument(
46+
"-n",
47+
"--n_max",
48+
type=int,
49+
default=200,
50+
help=(
51+
"Maximum number of robot policy evals (per policy) in the evaluation procedure. "
52+
"Defaults to 200."
53+
),
54+
)
55+
parser.add_argument(
56+
"-a",
57+
"--alpha",
58+
type=float,
59+
default=0.05,
60+
help=(
61+
"Maximal allowed Type-1 error rate of the statistical testing procedure. "
62+
"Defaults to 0.05."
63+
),
64+
)
65+
parser.add_argument(
66+
"-pz",
67+
"--log_p_norm",
68+
type=float,
69+
default=0.0,
70+
help=(
71+
"Rate at which risk is accumulated, reflecting user's belief about underlying "
72+
"likelihood of different alternatives and nulls being true. If using a p_norm "
73+
", this variable is equivalent to log(p). If not using a p_norm, this is the "
74+
"argument to the zeta function, partial sums of which give the shape of the risk budget."
75+
"Defaults to 0.0."
76+
),
77+
)
78+
parser.add_argument(
79+
"-up",
80+
"--use_p_norm",
81+
type=str,
82+
default="False",
83+
help=(
84+
"Toggle whether to use p_norm or zeta function shape family for the risk budget. "
85+
"If True, uses p_norm shape; else, uses zeta function shape family. "
86+
"Defaults to False (zeta function partial sum family)."
87+
),
88+
)
89+
parser.add_argument(
90+
"-so",
91+
"--save_output",
92+
type=bool,
93+
default=True,
94+
help=(
95+
"Toggle whether save the evaluation output in a timestamped config file, as opposed to only printing to terminal. "
96+
"If True, the evaluation result is saved in a json file in the same directory as the data. "
97+
"Defaults to True."
98+
),
99+
)
100+
parser.add_argument(
101+
"-uda",
102+
"--use_default_alternative",
103+
type=bool,
104+
default=True,
105+
help=(
106+
"Determines the alternative to use in the testing procedure. If True, uses the alternative p0 < p1. "
107+
"Otherwise, uses the alternative p0 > p1. Defaults to True."
108+
),
109+
)
110+
111+
# Parse the input arguments
112+
args = parser.parse_args()
113+
if args.use_p_norm[0] == "F" or args.use_p_norm[0] == "f":
114+
args.use_p_norm = False
115+
elif args.use_p_norm[0] == "T" or args.use_p_norm[0] == "t":
116+
args.use_p_norm = True
117+
else:
118+
raise ValueError(
119+
"Invalid argument {for use_p_norm}; must be a string beginning with 'T' or 'F'"
120+
)
121+
122+
# Define the base path so that the file can be run from any directory
123+
# base_path = os.path.join(os.path.dirname(os.path.realpath(__file__)), "..")
124+
base_path = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
125+
126+
# Define the data folder path and data file path
127+
data_folder_path = os.path.join(base_path, "data", args.data_folder)
128+
data_file_path = os.path.join(data_folder_path, args.data_file)
129+
130+
# Assign step test alternative hypothesis
131+
if args.use_default_alternative:
132+
alt = Hypothesis.P0LessThanP1
133+
alt_str = "P0 < P1"
134+
null_str = "P0 >= P1"
135+
else:
136+
alt = Hypothesis.P0MoreThanP1
137+
alt_str = "P0 > P1"
138+
null_str = "P0 <= P1"
139+
# Initialize the STEP test
140+
mirrored_step_test = MirroredStepTest(
141+
alternative=alt,
142+
n_max=args.n_max,
143+
alpha=args.alpha,
144+
shape_parameter=args.log_p_norm,
145+
use_p_norm=args.use_p_norm,
146+
)
147+
148+
# Load data and evaluate
149+
data = np.load(data_file_path)
150+
151+
result = mirrored_step_test.run_on_sequence(data[:, 0], data[:, 1])
152+
153+
decision_str = "No decision"
154+
if result.decision == Decision.AcceptAlternative:
155+
decision_str = alt_str
156+
elif result.decision == Decision.AcceptNull:
157+
decision_str = null_str
158+
else:
159+
pass
160+
161+
# Print to terminal
162+
print(
163+
f"Evaluation result for data stored at: {os.path.join(data_folder_path, args.data_file)}"
164+
)
165+
print(f"Decision: {result.decision} --> we conclude that {decision_str}")
166+
print(f"Time of decision: {result.info['Time']}")
167+
168+
# Save to json file
169+
if args.save_output:
170+
now = datetime.now()
171+
formatted_time = now.strftime("%Y-%m-%d_%H:%M:%S")
172+
evaluation_dict = {
173+
"result": [
174+
{
175+
"Hypothesis": alt_str,
176+
"Decision": decision_str,
177+
"Time": result.info["Time"],
178+
}
179+
],
180+
"method": "STEP",
181+
"params": [
182+
{
183+
"n_max": args.n_max,
184+
"alpha": args.alpha,
185+
}
186+
],
187+
"p0_hat": np.mean(data[:, 0]),
188+
"p1_hat": np.mean(data[:, 1]),
189+
"N": data.shape[0],
190+
}
191+
192+
with open(
193+
data_folder_path + "/" + f"evaluation_result_{formatted_time}.json", "w"
194+
) as fp:
195+
json.dump(evaluation_dict, fp)

0 commit comments

Comments
 (0)